Last mini-batch redistribution in distributed samplers

This change is intended to prevent training/validation loops hanging in multi-GPU setups when samplers have the setting drop_last=False. When drop_last=True and the number of GPUs (world_size) is large, it leads to discarding up to world_size - 1 mini-batches, which is not acceptable for validation data as it's typically small.

The default (drop_last=False) will now redistribute the data across mini-batches intended for each rank (since each rank has the access to the all ranks mini-batch in the current step anyway) in a consistent way across ranks to yield a partial mini-batch everywhere. When the number of available examples is less than world_size, we will duplicate the examples to cover difference, ensuring each rank gets a 1-element mini-batch. Note: this duplication is consistent with PyTorch's DistributedSampler behavior; we missed it when creating Lhotse samplers.

lhotse-speech / lhotse

Last mini-batch redistribution in distributed samplers #1277