Closed zifuwan closed 6 months ago
Hi, thank you for your interest.
About the ddp training process, we can make it to run on two or four gpus based on the script. Maybe you could have a check on the config file of your machine. Otherwise, I also suggest to have a look on our new implementation in DELIVER repository.
Let me know if you can make it or not.
您好,我也遇到了这个问题,请问您解决了吗,双卡训练卡在显示pytorch版本这里,单卡训练没有问题
Hi, I don't quite remember how I solved this issue. But you can try to add NCCL_P2P_DISABLE=1 in the front. this would be something like: NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES="0,1,2,3" python -m torch.distributed.launch --nproc_per_node=4 --master_port 29502 train.py -p 29502 -d 0,1,2,3 -n "dataset_name"
Hi, has anyone successfully launch the distributed training process? When I test with
CUDA_VISIBLE_DEVICES="2,3" python -m torch.distributed.launch --nproc_per_node=2 --node_rank=0 train.py -d 2,3
, the program is always stuck when running todist.init_process_group(backend="nccl",rank=rank, world_size=self.world_size, init_method='env://')
.Could you help me with this issue?
Thanks.