This change is intended to prevent training/validation loops hanging in multi-GPU setups when samplers have the setting drop_last=False. When drop_last=True and the number of GPUs (world_size) is large, it leads to discarding up to world_size - 1 mini-batches, which is not acceptable for validation data as it's typically small.
The default (drop_last=False) will now redistribute the data across mini-batches intended for each rank (since each rank has the access to the all ranks mini-batch in the current step anyway) in a consistent way across ranks to yield a partial mini-batch everywhere. When the number of available examples is less than world_size, we will duplicate the examples to cover difference, ensuring each rank gets a 1-element mini-batch. Note: this duplication is consistent with PyTorch's DistributedSampler behavior; we missed it when creating Lhotse samplers.
This change is intended to prevent training/validation loops hanging in multi-GPU setups when samplers have the setting
drop_last=False
. Whendrop_last=True
and the number of GPUs (world_size
) is large, it leads to discarding up toworld_size - 1
mini-batches, which is not acceptable for validation data as it's typically small.The default (
drop_last=False
) will now redistribute the data across mini-batches intended for each rank (since each rank has the access to the all ranks mini-batch in the current step anyway) in a consistent way across ranks to yield a partial mini-batch everywhere. When the number of available examples is less thanworld_size
, we will duplicate the examples to cover difference, ensuring each rank gets a 1-element mini-batch. Note: this duplication is consistent with PyTorch'sDistributedSampler
behavior; we missed it when creating Lhotse samplers.