megvii-research / MSPN

Multi-Stage Pose Network
334 stars 63 forks source link

About a Trainning quesstion #4

Closed ZhongXiaoFang closed 5 years ago

ZhongXiaoFang commented 5 years ago

Hello,very nice to get this well done job. I have meet a problem when I runing the trainning command according to given. like this: Traceback (most recent call last): File "train.py", line 117, in main() File "train.py", line 53, in main
data_loader = get_train_loader(cfg, num_gpu=num_gpu, is_dist=True) File "/home/zhong/MSPN-master/lib/utils/dataloader.py", line 31, in get_train_loader sampler = torch_samplers.DistributedSampler(dataset, shuffle=is_shuffle)
File "/home/zhong/MSPN-master/cvpack/dataset/torch_samplers/distributed.py", line 29, in init
num_replicas = dist.get_world_size()
File "/usr/local/lib/python3.5/dist-packages/torch/distributed/distributed_c10d.py", line 584, in get_world_size
return _get_group_size(group) File "/usr/local/lib/python3.5/dist-packages/torch/distributed/distributed_c10d.py", line 200, in _get_group_size _check_default_pg() File "/usr/local/lib/python3.5/dist-packages/torch/distributed/distributed_c10d.py", line 191, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.5/dist-packages/torch/distributed/launch.py", line 235, in main() File "/usr/local/lib/python3.5/dist-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1

ZhongXiaoFang commented 5 years ago

@megvii-wzc @fenglinglwb

fenglinglwb commented 5 years ago

Are all prerequisites satisfied? If yes, could you show me the training command?

ZhongXiaoFang commented 5 years ago

@fenglinglwb thank you for you answer

achigeor commented 5 years ago

Have the same issue. All requirements installed, with python version==3.7.3

Error occurs with: python -m torch.distributed.launch --nproc_per_node=1 train.py or with python train.py

achigeor commented 5 years ago

Solved, opening a PR :)