This PR enables specifying seed="randomized" and seed="trng" for DynamicCutSampler and DynamicBucketingSampler.
Both options are intended for use with IterableDatasetWrapper and cause the samplers to iterate with different random seeds in each node and dataloading worker. Note that for bucketing this will have the effect of de-synchronizing batch sizes across GPUs from the start of iteration (before the change, this occurs anyway after a number of training steps as observed in https://github.com/lhotse-speech/lhotse/discussions/857).
From now on, the sampler also attaches a custom field called dataloading_info to each cut which is a dict containing rank, world_size, and worker_id keys that help diagnose the dataloading.
This PR enables specifying
seed="randomized"
andseed="trng"
forDynamicCutSampler
andDynamicBucketingSampler
.Both options are intended for use with
IterableDatasetWrapper
and cause the samplers to iterate with different random seeds in each node and dataloading worker. Note that for bucketing this will have the effect of de-synchronizing batch sizes across GPUs from the start of iteration (before the change, this occurs anyway after a number of training steps as observed in https://github.com/lhotse-speech/lhotse/discussions/857).From now on, the sampler also attaches a custom field called
dataloading_info
to each cut which is a dict containingrank
,world_size
, andworker_id
keys that help diagnose the dataloading.