Open shushanxingzhe opened 1 week ago
I suggest either moving to Lhotse shar format (see the tutorial in examples directory); or sharding your manifest into a lot of small chunks and using CutSet.from_files with random seed set to „trng”, calling .repeat() in CutSet (makes it infinite) and then manually overriding rank to 0 and world size to 1 in the sampler on every GPU. Finally you can wrap both the sampler and dataset into IterableDatasetWrapper (but with non shar data maybe its not needed). This will cause the order of data iteration to be different on each dataloading worker instead of trying to deduplicate. In practice it works just as well but you need to count training steps instead of epochs.
@pzelasko When I use DynamicBucketingSampler on a 600 gpu card cluster, the code at https://github.com/lhotse-speech/lhotse/blob/e2b149dc70b74532329e04dc1e6e6ff8ecc1cce9/lhotse/dataset/sampling/base.py#L297 waste a lot of time. since the world_size 600 need a lot of time to loop. Could you please give me any advice to reduce the time on that.