Closed pzelasko closed 5 months ago
To illustrate this works on a toy example: let's use mini LibriSpeech (5h, ~1500 cuts) with DynamicBucketingSampler(cuts, ..., shuffle=True, shuffle_buffer_size=100, buffer_size=1000)
(deliberately small values relative to the dataset size). Testing across multiple settings of num_buckets = (2, 5, 10, 20, 30, 50)
and max_duration = (50, 100, 200, 400)
, we measure the Spearman's correlation between input and output list of cut IDs (input: as in manifest; output: flattened mini-batch CutSets yielded by the sampler).
We find the old implementation has a mean R of 0.78, the new implementation has a mean R of 0.17 -- means the sequence of cuts in the new implementation is much closer to truly shuffled, as it can use a 10x shuffling buffer capacity despite not having a separate buffer for that.
The failed tests are flaky -- merging.
I realized it's redundant to have two buffers, and we can instead use a single buffer to achieve both goals of shuffling and bucketing (and achieve them better). The basic idea is that we remove the shuffle buffer and let
DynamicBucketer
choose examples from the bucketing buffer at random without replacements (whereas previously, it just sampled the first N until hitting the constraint). Effectively, it's using the larger size of both buffers in the old implementation for both tasks while saving the space of the smaller buffer.