Closed mwawrzos closed 3 years ago
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
@mwawrzos - for easier tracking, would you mind sharing the data here showing the change in epochs to converge with and without bucketing? And also, the corresponding batch-sizes used? thanks!
@qpjaada - sure, here are number of epochs for BS=2048 without bucketing: 56, 56, 54, 56, 57, 58, 55, 56, 60, 56, 58, 57, 53, 60, 57, 57, 55, 57, 56, 56 average: 56.5, stdev: 1.701392618
For the same batch size, RCPs with bucketing looks like this: 57, 58, 58, 56, 60, 63, 59, 59, 60, 59, 58, 58, 56, 57, 58, 61, 59, 57, 59, 58 average: 58.5, stdev: 1.670171753
on a histogram it looks like this:
@mwawrzos - this change seems fine to us. I think it would be good to update the reference code to reflect a default of NUM_BUCKETS=1
(so that any new participant can observe the best convergence behavior with the reference):
https://github.com/mlcommons/training/blob/master/rnn_speech_recognition/pytorch/scripts/train.sh#L31
I assume we should updating the rcps for v1.1 here:
https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/1.0.0/rcps_rnnt.json
I opened the requested PRs:
I was asked outside the thread if converging in fewer epochs is the main argument for this PR.
The other benefit is reduced variance in results. This is not visible in the results obtained with reference, because these results were obtained with gradient accumulation. Gradient accumulation mitigates issues that come with bucketing sampling. The lower variance is easier to notice on our submission results:
@emizan76 , and others, any objections to merging this one? I think reduced variance is a good goal for all the references, so this sounds reasonable to me.
No objections here. I think we already approved in the logging repo.
We would like to unfreeze the bucketing sampler hyperparameter. Bucketing sampling trades off performance for training accuracy. With more buckets, samples in a single batch have lower length distribution. However, the randomness of samples is also smaller, which impacts convergence and training stability.