Closed zaiquanyang closed 2 weeks ago
Hello, did you change the CUDA_VISIBLE_DEVICES=0,1,2,3 into 0,1 along with --nproc_per_node 4 into --nproc_per_node 2 in the sh file?
Hello, did you change the CUDA_VISIBLE_DEVICES=0,1,2,3 into 0,1 along with --nproc_per_node 4 into --nproc_per_node 2 in the sh file?
Yes. TORCH_DISTRIBUTED_DEBUG=INFO PYTHONPATH="$PWD" CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ --nproc_per_node 2 \ --master_port 4446 \ train_dist_mod.py \ --num_decoder_layers 6 \ --use_color \ --debug \ --weight_decay 0.0005 \ --data_root "${DATA_ROOT}" \ --val_freq 1 --batch_size 4 --save_freq 1 --print_freq 500 \ --num_workers 4 \ --lr_backbone=2e-3 \ --lr=2e-4 \ --dataset scanrefer \ --test_dataset scanrefer \ --detect_intermediate \ --joint_det \ --use_soft_token_loss \ --use_contrastive_align \ --log_dir "${DATA_ROOT}/output/logs/" \ --lr_decay_epochs 50 75 \ --pp_checkpoint "${DATA_ROOT}/checkpoints/gf_detector_l6o256.pth" \ --butd \ --self_attend \ --augment_det \ --max_epoch 100 \ --model MCLN \ --exp MCLN
Hello, did you change the CUDA_VISIBLE_DEVICES=0,1,2,3 into 0,1 along with --nproc_per_node 4 into --nproc_per_node 2 in the sh file?
I am not sure if it is caused by different torch versions (my torch version is 2.0.0). Now, I have turned to obtain the local_rank from environment variable. However, I've encountered this problem for the first time.
Maybe it's caused by different torch versions, because I and others who follow my version don't meet this problem.
Maybe it's caused by different torch versions, because I and others who follow my version don't meet this problem.
Thanks for your reply !
When I try to launch the ddp training on two GPUs, it seems that the local_rank is not passed to each process rightly. I checked and found that the values of local_rank are 0 for two process.