Closed lhovon closed 1 year ago
MLCommons CLA bot:
Thank you for your submission, we really appreciate it. We ask that you sign our MLCommons CLA and be a member before we can accept your contribution. If you are interested in membership, please contact membership@mlcommons.org .
0 out of 1 committers have signed the MLCommons CLA.
:x: @jovonho
You can retrigger this bot by commenting recheck in this Pull Request
@mmarcinkiewicz could you review, please?
Hi @jovonho. I'm a bit confused why this PR is needed. The code works fine in a single-node docker-based system, just by invoking:
python -m torch.distributed.launch --nproc_per_node=8 main.py --data_dir /data --epochs 10000 --start_eval_at 100 --eval_every 20 [...]
What errors did you encounter?
Hi Michał, you're right. The conversation was locked so I could not warn you, but I did not know about this method at the time. You can close this!
Gotcha, thank you!
Hi,
I made these changes to use multiple GPUs on a single machine for experiments I'm running for the MLCommons Storage WG.
How to use the current implementation for single-node multi-GPU training is not documented, and it seems to be multi-node oriented, e.g. requiring the RANK environmental variable to be defined. I followed this guide to develop these changes https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51 and have validated my work by training on 8 GPUs until the model converged to 0.908 mean dice.
I would like to know if these changes are useful and if not, how to use the current implementation for single-node multi-gpu training. If useful, we can discuss how to best integrate them. I tried making the least amount of changes and having a switch for it. I have also tested with 1-8 GPUs but have not done comprehensive regression testing using other run modes.
Thanks