A negative impact on performance, as the longes samples are mixed with shorter samples. As result, more padding is needed.
It impacts sample randomness, as the longest samples are chosen more often.
This PR aligns code to the algorithm described in the README.
An alternative solution is to update the README, and submitters would need to follow the bugged algorithm.
The bucket sampling algorithm described in the README removes random samples from epoch to make epoch divisible by the batch size (point 4.ii).
Source has a bug, where the samples are not removed, but replaced by samples from the last bucket, that contains the longest samples:https://github.com/mlcommons/training/blob/8f7f74f88874ae85a58ddedd778c320739b37444/rnn_speech_recognition/pytorch/common/data/dali/sampler.py#L86
The bug has two impacts:
This PR aligns code to the algorithm described in the README. An alternative solution is to update the README, and submitters would need to follow the bugged algorithm.