jackroos / VL-BERT

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
MIT License
738 stars 110 forks source link

subprocess.CalledProcessError: Command xxx returned non-zero exit status 1. #41

Closed gm0616 closed 3 years ago

gm0616 commented 4 years ago

When I try training on VCR dataset with the comand ./scripts/dist_run_single.sh 1 vcr/train_end2end.py ./cfgs/vcr/base_q2a_4x16G_fp32.yaml ./, I got an error like this:

Traceback (most recent call last):
  File "vcr/train_end2end.py", line 59, in <module>
    main()
  File "vcr/train_end2end.py", line 53, in main
    rank, model = train_net(args, config)
  File "/gruntdata/guimin.gm/vlbert/vcr/../vcr/function/train.py", line 87, in train_net
    group_name='mtorch')
  File "/home/guimin.gm/miniconda3/envs/pt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/guimin.gm/miniconda3/envs/pt/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler
    store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "./scripts/launch.py", line 200, in <module>
    main()
  File "./scripts/launch.py", line 196, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['/home/guimin.gm/miniconda3/envs/pt/bin/python', '-u', 'vcr/train_end2end.py', '--cfg', './cfgs/vcr/base_q2a_4x16G_fp32.yaml', '--model-dir', './', '--dist']' returned non-zero exit status 1.

I haven`t found the solution, is there anybody can help me? Thanks a lot.

liulijie-2020 commented 4 years ago

the same problem happen when try Distributed Training. why not try Non-Distributed Training?

jackroos commented 4 years ago

It seems the port has been used by other programs. Could you try to modify the port in: https://github.com/jackroos/VL-BERT/blob/4373674cbf2bcd6c09a2c26abfdb6705b870e3be/scripts/launch.py#L134

manibharathy1 commented 2 years ago

Can anyone help me how to solve this while run training I got this error? AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.6 -m launch_ddp --config configs/dist-training-config.yaml" Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/ml/code/launch_ddp.py", line 42, in raise subprocess.CalledProcessError(returncode=process.returncode, cmd=joint_cmd) subprocess.CalledProcessError: Command 'python -m torch.distributed.launch --nnodes 1 --node_rank 0 --nproc_per_node 1 --master_addr algo-1 --master_port 55555 /opt/ml/code/train.py --config configs/dist-training-config.yaml' returned non-zero exit status 1., exit code: 1