lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
904 stars 204 forks source link

Merge shuffling and bucketing buffers in DynamicBucketingSampler #1276

Closed pzelasko closed 5 months ago

pzelasko commented 5 months ago

I realized it's redundant to have two buffers, and we can instead use a single buffer to achieve both goals of shuffling and bucketing (and achieve them better). The basic idea is that we remove the shuffle buffer and let DynamicBucketer choose examples from the bucketing buffer at random without replacements (whereas previously, it just sampled the first N until hitting the constraint). Effectively, it's using the larger size of both buffers in the old implementation for both tasks while saving the space of the smaller buffer.

pzelasko commented 5 months ago

To illustrate this works on a toy example: let's use mini LibriSpeech (5h, ~1500 cuts) with DynamicBucketingSampler(cuts, ..., shuffle=True, shuffle_buffer_size=100, buffer_size=1000) (deliberately small values relative to the dataset size). Testing across multiple settings of num_buckets = (2, 5, 10, 20, 30, 50) and max_duration = (50, 100, 200, 400), we measure the Spearman's correlation between input and output list of cut IDs (input: as in manifest; output: flattened mini-batch CutSets yielded by the sampler).

We find the old implementation has a mean R of 0.78, the new implementation has a mean R of 0.17 -- means the sequence of cuts in the new implementation is much closer to truly shuffled, as it can use a 10x shuffling buffer capacity despite not having a separate buffer for that.

pzelasko commented 5 months ago

The failed tests are flaky -- merging.