unable to train/sample using mpiexec on multiple GPUs

aksy1999 commented 2 years ago

Thanks for providing the code implementation.

I am able to train and use the model on 1 GPU but I am having issues while using multiple GPUs .

I am creating multiple process using mpiexec as suggested in the repo (I tried mpiexec from both OpenMPI and MPICH and I am having the same issue).

Issue: For both sampling and training cases, multiple processes are created and models load on GPUs. I am not able to sample/train. I see no progress at all (seems like a deadlock situation).

A) Below is an example of the commands I am running for inference/sampling (as suggested in this repo- openai/guided_diffusion)

mpiexec -n 8 python classifier_sample.py --attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --classifier_scale 1.0 --classifier_path "models/256x256_classifier.pt" --model_path "models/256x256_diffusion.pt" --batch_size 1 --num_samples 4 --timestep_respacing 250

Problem A: The program is stopping at <line93, classifier_sample.py> i.e ,all_images.extend([sample.cpu().numpy() for sample in gathered_samples])

B) Below is an example of a command I am running for training (as suggested in the parent repo – /openai/improved diffusion)

mpiexec -n 8 python image_train.py --data_dir ./data_dir --image_size 256 --class_cond False --learn_sigma True --num_channels 256 --num_res_blocks 2 --num_head_channels 64 --attention_resolutions 32,16,8 --dropout 0.1 --diffusion_steps 1000 --noise_schedule linear --use_checkpoint True --use_scale_shift_norm True --resblock_updown True --use_fp16 True --use_new_attention_order True --lr 1e-4 --batch_size 32

Problem B: The program is stopping in TrainLoop init function- where distributeDataParallel(DDP) function is called i.e, self.ddp_model = DDP( self.model, device_ids=[dist_util.dev()], output_device=dist_util.dev(), broadcast_buffers=False, bucket_cap_mb=128, find_unused_parameters=False,)

I have waited for approximately 24 hours to observe if code runs, but it did not. I have tried different approaches also to create multiple process such as python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 and multiprocess.spawn. They did not work.

With this issue:

A) I request, if possible, could please provide the version details of all the dependencies. Such as PyTorch, CUDA, CUDNN, Python, OpenMPI/MPICH, mpi4py and so on. My problems may be due to dependency version incompatibility.

I also build PyTorch from the source with CUDA 11.2 and had the same issues.

B) Do you have any suggestions/insights for training. Did you see any such behavior? Could you please suggest a training strategy for ablation study?

Below are the dependencies version I am using currently (issue is reproducible with these version):

conda 4.10.3 Python 3.9.7 PyTorch 1.9.1 (py3.9_cuda11.1_cudnn8.0.5_0) cudatoolkit 11.1.74 mpich 3.4.2 mpi4py 3.1.1

I will be happy to provide any other details related to the dependencies I am using.

guillaumejs2403 commented 2 years ago

Hi, I had the same problem while running the training code. It seems that there is a deadlock with NCCL 2.7.8 (check here). Try using export NCCL_P2P_DISABLE=1 before using it, it worked for me'

python 3.8.11 pytorch 1.9.1 cudatoolkit 10.2.89 mpi4py 3.0.3

Kai-0515 commented 2 years ago

thx so much!!!

JiamingLiu-Jeremy commented 2 years ago

Hi, I had the same problem while running the training code. It seems that there is a deadlock with NCCL 2.7.8 (check here). Try using export NCCL_P2P_DISABLE=1 before using it, it worked for me'

python 3.8.11 pytorch 1.9.1 cudatoolkit 10.2.89 mpi4py 3.0.3

Thanks for the suggestions. Have you been facing the problem during resumeing training, e.g. loading ckpt, optimizer and so forth, when applying multi-gpus.

vie131313 commented 4 months ago

Hi, I had the same problem while running the training code. It seems that there is a deadlock with NCCL 2.7.8 (check here). Try using export NCCL_P2P_DISABLE=1 before using it, it worked for me'

python 3.8.11 pytorch 1.9.1 cudatoolkit 10.2.89 mpi4py 3.0.3

Thanks for the suggestions. Have you been facing the problem during resumeing training, e.g. loading ckpt, optimizer and so forth, when applying multi-gpus.I found this code only can run again using one process and couldn't work on multi-process when I resume it.Do you know why?Thanks very much!!!

openai / guided-diffusion

unable to train/sample using mpiexec on multiple GPUs #12