Stuck during dist.init_process_group(backend="nccl",rank=rank, world_size=self.world_size, init_method='env://')

huaaaliu / RGBX_Semantic_Segmentation

MIT License

306 stars 29 forks source link

Stuck during dist.init_process_group(backend="nccl",rank=rank, world_size=self.world_size, init_method='env://') #43

Closed zifuwan closed 6 months ago

zifuwan commented 8 months ago

Hi, has anyone successfully launch the distributed training process? When I test with CUDA_VISIBLE_DEVICES="2,3" python -m torch.distributed.launch --nproc_per_node=2 --node_rank=0 train.py -d 2,3, the program is always stuck when running to dist.init_process_group(backend="nccl",rank=rank, world_size=self.world_size, init_method='env://').

Could you help me with this issue?

Thanks.

jamycheung commented 6 months ago

Hi, thank you for your interest.

About the ddp training process, we can make it to run on two or four gpus based on the script. Maybe you could have a check on the config file of your machine. Otherwise, I also suggest to have a look on our new implementation in DELIVER repository.

Let me know if you can make it or not.

LeonSakura commented 4 months ago

您好，我也遇到了这个问题，请问您解决了吗，双卡训练卡在显示pytorch版本这里，单卡训练没有问题

zifuwan commented 4 months ago

Hi, I don't quite remember how I solved this issue. But you can try to add NCCL_P2P_DISABLE=1 in the front. this would be something like: NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES="0,1,2,3" python -m torch.distributed.launch --nproc_per_node=4 --master_port 29502 train.py -p 29502 -d 0,1,2,3 -n "dataset_name"