openai / guided-diffusion

MIT License
6.06k stars 807 forks source link

training across mutiple nodes does not work #22

Closed forever208 closed 2 years ago

forever208 commented 2 years ago

if the number of GPUs > 8 (each node has 8 GPUs), then I have to train in several nodes

In this case, run by mpiexec -n 16 python script/image_train.py doesn't work.

It says the error of nccl

forever208 commented 2 years ago

instead of using mpiexec -n 16 using the following setting works in 2 nodes by 16 GPUs:

SBATCH --nodes=2

SBATCH --ntasks-per-node=8

SBATCH --cpus-per-task=6

SBATCH --gres=gpu:8 # 8 gpus for each node

then use mpirun python script/image_train.py

forever208 commented 2 years ago

another issue people might have is that, do remember to change the parameter GPUS_PER_NODE in the script guided_diffusion/dist_util.py if your cluster is not 8GPUs/node

The default value is 8 set by the author.

furkan-celik commented 2 years ago

Hi, I am having some trouble on mpiexec with this repo. I installed libopenmpi package on docker image with ubuntu and trying to run with multiple gpus right now. However, when I try to run it with mpiexec -n 2 python script/image_train.py, it doesn't start training and stuck at somewhere. I wonder whether I need to make a specific setting at docker image. Can you help me about this?

Germany321 commented 1 year ago

Hi, I try your mpirun method and set the nodes to 2 by 16 GPUs. However, NCCL error occurs when I do this.

File "scripts/image_train.py", line 59, in main batch_size_sample=args.batch_size_sample, File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/train_util.py", line 75, in __init__ self._load_and_sync_parameters() File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/train_util.py", line 130, in _load_and_sync_parameters dist_util.sync_params(self.model.parameters()) File "/mnt/lustre/data/research/workspace/guided-diffusion-main/guided_diffusion/dist_util.py", line 83, in sync_params dist.broadcast(p, 0) File "/mnt/cache/anaconda3/envs/improved_diffusion/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).