Hi, I have met a problem. I have a single server which has 8 gpus. I used ubuntu16.04 and pytorch1.4 and my cuda is 10.0.
The problem is that I met an error when I used following command:
CUDA_VISIBLE_DEVICES=0 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \
--checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \
--cuda --sent_weight 0.25 | tee log/$id-0 &
CUDA_VISIBLE_DEVICES=1 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \
--checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \
--cuda --sent_weight 0.25 | tee log/$id-1 &
CUDA_VISIBLE_DEVICES=2 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \
--checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \
--cuda --sent_weight 0.25 | tee log/$id-2 &
CUDA_VISIBLE_DEVICES=3 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \
--checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \
--cuda --sent_weight 0.25 | tee log/$id-3
ValueError: Error initializing torch.distributed using file:// rendezvous: rank parameter missing
Hi, I have met a problem. I have a single server which has 8 gpus. I used ubuntu16.04 and pytorch1.4 and my cuda is 10.0. The problem is that I met an error when I used following command: CUDA_VISIBLE_DEVICES=0 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \ --checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \ --cuda --sent_weight 0.25 | tee log/$id-0 & CUDA_VISIBLE_DEVICES=1 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \ --checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \ --cuda --sent_weight 0.25 | tee log/$id-1 & CUDA_VISIBLE_DEVICES=2 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \ --checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \ --cuda --sent_weight 0.25 | tee log/$id-2 & CUDA_VISIBLE_DEVICES=3 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \ --checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \ --cuda --sent_weight 0.25 | tee log/$id-3 ValueError: Error initializing torch.distributed using file:// rendezvous: rank parameter missing