salesforce / densecap

BSD 3-Clause "New" or "Revised" License
188 stars 61 forks source link

rank parameter missing #40

Open tuyunbin opened 4 years ago

tuyunbin commented 4 years ago

Hi, I have met a problem. I have a single server which has 8 gpus. I used ubuntu16.04 and pytorch1.4 and my cuda is 10.0. The problem is that I met an error when I used following command: CUDA_VISIBLE_DEVICES=0 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \ --checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \ --cuda --sent_weight 0.25 | tee log/$id-0 & CUDA_VISIBLE_DEVICES=1 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \ --checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \ --cuda --sent_weight 0.25 | tee log/$id-1 & CUDA_VISIBLE_DEVICES=2 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \ --checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \ --cuda --sent_weight 0.25 | tee log/$id-2 & CUDA_VISIBLE_DEVICES=3 python3 scripts/train.py --dist_url 'file:///data/luwantong/nonexistent_file' --cfgs_file cfgs/yc2.yml \ --checkpoint_path ./checkpoint/$id --batch_size 14 --world_size 4 \ --cuda --sent_weight 0.25 | tee log/$id-3 ValueError: Error initializing torch.distributed using file:// rendezvous: rank parameter missing

LuoweiZhou commented 4 years ago

The code is only compatible with an old version of PyTorch (0.4.0) which uses a different/old distributed data parallel package.