mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 549 forks source link

WIP: Image segmentation single-node multi-gpu #553

Closed lhovon closed 1 year ago

lhovon commented 2 years ago

Hi,

I made these changes to use multiple GPUs on a single machine for experiments I'm running for the MLCommons Storage WG.

How to use the current implementation for single-node multi-GPU training is not documented, and it seems to be multi-node oriented, e.g. requiring the RANK environmental variable to be defined. I followed this guide to develop these changes https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51 and have validated my work by training on 8 GPUs until the model converged to 0.908 mean dice.

I would like to know if these changes are useful and if not, how to use the current implementation for single-node multi-gpu training. If useful, we can discuss how to best integrate them. I tried making the least amount of changes and having a switch for it. I have also tested with 1-8 GPUs but have not done comprehensive regression testing using other run modes.

Thanks

github-actions[bot] commented 2 years ago

MLCommons CLA bot:
Thank you for your submission, we really appreciate it. We ask that you sign our MLCommons CLA and be a member before we can accept your contribution. If you are interested in membership, please contact membership@mlcommons.org .
0 out of 1 committers have signed the MLCommons CLA.
:x: @jovonho
You can retrigger this bot by commenting recheck in this Pull Request

johntran-nv commented 1 year ago

@mmarcinkiewicz could you review, please?

mmarcinkiewicz commented 1 year ago

Hi @jovonho. I'm a bit confused why this PR is needed. The code works fine in a single-node docker-based system, just by invoking:

python -m torch.distributed.launch --nproc_per_node=8 main.py --data_dir /data --epochs 10000 --start_eval_at 100 --eval_every 20 [...]

What errors did you encounter?

lhovon commented 1 year ago

Hi Michał, you're right. The conversation was locked so I could not warn you, but I did not know about this method at the time. You can close this!

mmarcinkiewicz commented 1 year ago

Gotcha, thank you!