lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
941 stars 215 forks source link

On a large GPU cluster, DynamicBucketingSampler.__next__ spend a lot of time #1399

Open shushanxingzhe opened 1 week ago

shushanxingzhe commented 1 week ago

@pzelasko When I use DynamicBucketingSampler on a 600 gpu card cluster, the code at https://github.com/lhotse-speech/lhotse/blob/e2b149dc70b74532329e04dc1e6e6ff8ecc1cce9/lhotse/dataset/sampling/base.py#L297 waste a lot of time. since the world_size 600 need a lot of time to loop. Could you please give me any advice to reduce the time on that.

pzelasko commented 1 week ago

I suggest either moving to Lhotse shar format (see the tutorial in examples directory); or sharding your manifest into a lot of small chunks and using CutSet.from_files with random seed set to „trng”, calling .repeat() in CutSet (makes it infinite) and then manually overriding rank to 0 and world size to 1 in the sampler on every GPU. Finally you can wrap both the sampler and dataset into IterableDatasetWrapper (but with non shar data maybe its not needed). This will cause the order of data iteration to be different on each dataloading worker instead of trying to deduplicate. In practice it works just as well but you need to count training steps instead of epochs.