Closed ZhongXiaoFang closed 5 years ago
@megvii-wzc @fenglinglwb
Are all prerequisites satisfied? If yes, could you show me the training command?
@fenglinglwb thank you for you answer
Have the same issue. All requirements installed, with python version==3.7.3
Error occurs with: python -m torch.distributed.launch --nproc_per_node=1 train.py
or with
python train.py
Solved, opening a PR :)
Hello,very nice to get this well done job. I have meet a problem when I runing the trainning command according to given. like this: Traceback (most recent call last): File "train.py", line 117, in
main()
File "train.py", line 53, in main
main()
File "/usr/local/lib/python3.5/dist-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1
data_loader = get_train_loader(cfg, num_gpu=num_gpu, is_dist=True) File "/home/zhong/MSPN-master/lib/utils/dataloader.py", line 31, in get_train_loader sampler = torch_samplers.DistributedSampler(dataset, shuffle=is_shuffle)
File "/home/zhong/MSPN-master/cvpack/dataset/torch_samplers/distributed.py", line 29, in init
num_replicas = dist.get_world_size()
File "/usr/local/lib/python3.5/dist-packages/torch/distributed/distributed_c10d.py", line 584, in get_world_size
return _get_group_size(group) File "/usr/local/lib/python3.5/dist-packages/torch/distributed/distributed_c10d.py", line 200, in _get_group_size _check_default_pg() File "/usr/local/lib/python3.5/dist-packages/torch/distributed/distributed_c10d.py", line 191, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.5/dist-packages/torch/distributed/launch.py", line 235, in