On a large GPU cluster, DynamicBucketingSampler.__next__ spend a lot of time

lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.

Apache License 2.0

941 stars 215 forks source link

I suggest either moving to Lhotse shar format (see the tutorial in examples directory); or sharding your manifest into a lot of small chunks and using CutSet.from_files with random seed set to „trng”, calling .repeat() in CutSet (makes it infinite) and then manually overriding rank to 0 and world size to 1 in the sampler on every GPU. Finally you can wrap both the sampler and dataset into IterableDatasetWrapper (but with non shar data maybe its not needed). This will cause the order of data iteration to be different on each dataloading worker instead of trying to deduplicate. In practice it works just as well but you need to count training steps instead of epochs.

lhotse-speech / lhotse

On a large GPU cluster, DynamicBucketingSampler.__next__ spend a lot of time #1399

On a large GPU cluster, DynamicBucketingSampler.next spend a lot of time #1399