[RNN-T] bucketing sampler fix

mwawrzos commented 3 years ago

The bucket sampling algorithm described in the README removes random samples from epoch to make epoch divisible by the batch size (point 4.ii).

Source has a bug, where the samples are not removed, but replaced by samples from the last bucket, that contains the longest samples:https://github.com/mlcommons/training/blob/8f7f74f88874ae85a58ddedd778c320739b37444/rnn_speech_recognition/pytorch/common/data/dali/sampler.py#L86

The bug has two impacts:

A negative impact on performance, as the longes samples are mixed with shorter samples. As result, more padding is needed.
It impacts sample randomness, as the longest samples are chosen more often.

This PR aligns code to the algorithm described in the README. An alternative solution is to update the README, and submitters would need to follow the bugged algorithm.

github-actions[bot] commented 3 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

johntran-nv commented 3 years ago

+emizan@google.com, could you please approve as well?

mlcommons / training

[RNN-T] bucketing sampler fix #460