Closed gm0616 closed 3 years ago
the same problem happen when try Distributed Training. why not try Non-Distributed Training?
It seems the port has been used by other programs. Could you try to modify the port in: https://github.com/jackroos/VL-BERT/blob/4373674cbf2bcd6c09a2c26abfdb6705b870e3be/scripts/launch.py#L134
Can anyone help me how to solve this while run training I got this error?
AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.6 -m launch_ddp --config configs/dist-training-config.yaml" Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/ml/code/launch_ddp.py", line 42, in
When I try training on VCR dataset with the comand
./scripts/dist_run_single.sh 1 vcr/train_end2end.py ./cfgs/vcr/base_q2a_4x16G_fp32.yaml ./
, I got an error like this:I haven`t found the solution, is there anybody can help me? Thanks a lot.