Closed forever208 closed 2 years ago
instead of using mpiexec -n 16
using the following setting works in 2 nodes by 16 GPUs:
then use mpirun python script/image_train.py
another issue people might have is that, do remember to change the parameter GPUS_PER_NODE
in the script guided_diffusion/dist_util.py
if your cluster is not 8GPUs/node
The default value is 8 set by the author.
Hi, I am having some trouble on mpiexec with this repo. I installed libopenmpi package on docker image with ubuntu and trying to run with multiple gpus right now. However, when I try to run it with mpiexec -n 2 python script/image_train.py, it doesn't start training and stuck at somewhere. I wonder whether I need to make a specific setting at docker image. Can you help me about this?
Hi, I try your mpirun method and set the nodes to 2 by 16 GPUs. However, NCCL error occurs when I do this.
File "scripts/image_train.py", line 59, in main batch_size_sample=args.batch_size_sample, File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/train_util.py", line 75, in __init__ self._load_and_sync_parameters() File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/train_util.py", line 130, in _load_and_sync_parameters dist_util.sync_params(self.model.parameters()) File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/dist_util.py", line 83, in sync_params dist.broadcast(p, 0) File "/mnt/cache/anaconda3/envs/improved_diffusion/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
if the number of GPUs > 8 (each node has 8 GPUs), then I have to train in several nodes
In this case, run by
mpiexec -n 16 python script/image_train.py
doesn't work.It says the error of nccl