mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
92 stars 66 forks source link

[RNN-T] unfreeze bucketing sampler hyperparameter #453

Closed mwawrzos closed 3 years ago

mwawrzos commented 3 years ago

We would like to unfreeze the bucketing sampler hyperparameter. Bucketing sampling trades off performance for training accuracy. With more buckets, samples in a single batch have lower length distribution. However, the randomness of samples is also smaller, which impacts convergence and training stability.

github-actions[bot] commented 3 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

qpjaada commented 3 years ago

@mwawrzos - for easier tracking, would you mind sharing the data here showing the change in epochs to converge with and without bucketing? And also, the corresponding batch-sizes used? thanks!

mwawrzos commented 3 years ago

@qpjaada - sure, here are number of epochs for BS=2048 without bucketing: 56, 56, 54, 56, 57, 58, 55, 56, 60, 56, 58, 57, 53, 60, 57, 57, 55, 57, 56, 56 average: 56.5, stdev: 1.701392618

For the same batch size, RCPs with bucketing looks like this: 57, 58, 58, 56, 60, 63, 59, 59, 60, 59, 58, 58, 56, 57, 58, 61, 59, 57, 59, 58 average: 58.5, stdev: 1.670171753

on a histogram it looks like this: RNN-T convergence (1)

qpjaada commented 3 years ago

@mwawrzos - this change seems fine to us. I think it would be good to update the reference code to reflect a default of NUM_BUCKETS=1 (so that any new participant can observe the best convergence behavior with the reference): https://github.com/mlcommons/training/blob/master/rnn_speech_recognition/pytorch/scripts/train.sh#L31 I assume we should updating the rcps for v1.1 here: https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/1.0.0/rcps_rnnt.json

mwawrzos commented 3 years ago

I opened the requested PRs:

mwawrzos commented 3 years ago

I was asked outside the thread if converging in fewer epochs is the main argument for this PR.

The other benefit is reduced variance in results. This is not visible in the results obtained with reference, because these results were obtained with gradient accumulation. Gradient accumulation mitigates issues that come with bucketing sampling. The lower variance is easier to notice on our submission results: submission_bucketing

johntran-nv commented 3 years ago

@emizan76 , and others, any objections to merging this one? I think reduced variance is a good goal for all the references, so this sounds reasonable to me.

emizan76 commented 3 years ago

No objections here. I think we already approved in the logging repo.