Closed lhovon closed 1 year ago
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
LGTM, good catch. I'll run some RCP checks to verify if the convergence is affected
Opened https://github.com/mlcommons/logging/pull/310
I don't think the convergence is that much affected, but that 3-4% I mentioned at the end is possible. Closing. Thanks!
approved in Training WG 05/04/2023.
The DistributedSampler seed should be the same across all workers, else the data split is not exclusive. https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler
Indeed, printing out the indexes read by each worker shows overlap (around 20% to 30%!)
It looks like the shuffling seeds might have been meant for this, but were not used. I took the first one to initialize the sampler. The seed will then be updated for each epoch by the call to
sampler.set_epoch()
already present in training.py.