mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 549 forks source link

UNET3D - Change DistributedSampler seed to be same across all workers #625

Closed lhovon closed 1 year ago

lhovon commented 1 year ago

The DistributedSampler seed should be the same across all workers, else the data split is not exclusive. https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler

Indeed, printing out the indexes read by each worker shows overlap (around 20% to 30%!)

It looks like the shuffling seeds might have been meant for this, but were not used. I took the first one to initialize the sampler. The seed will then be updated for each epoch by the call to sampler.set_epoch() already present in training.py.

github-actions[bot] commented 1 year ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

mmarcinkiewicz commented 1 year ago

LGTM, good catch. I'll run some RCP checks to verify if the convergence is affected

mmarcinkiewicz commented 1 year ago

Opened https://github.com/mlcommons/logging/pull/310

I don't think the convergence is that much affected, but that 3-4% I mentioned at the end is possible. Closing. Thanks!

nv-rborkar commented 1 year ago

approved in Training WG 05/04/2023.