lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
904 stars 204 forks source link

Enable seed randomization in dynamic samplers #1278

Closed pzelasko closed 5 months ago

pzelasko commented 5 months ago

This PR enables specifying seed="randomized" and seed="trng" for DynamicCutSampler and DynamicBucketingSampler.

Both options are intended for use with IterableDatasetWrapper and cause the samplers to iterate with different random seeds in each node and dataloading worker. Note that for bucketing this will have the effect of de-synchronizing batch sizes across GPUs from the start of iteration (before the change, this occurs anyway after a number of training steps as observed in https://github.com/lhotse-speech/lhotse/discussions/857).

From now on, the sampler also attaches a custom field called dataloading_info to each cut which is a dict containing rank, world_size, and worker_id keys that help diagnose the dataloading.

pzelasko commented 5 months ago

The failing test is flaky - merging