ERROR:torch.distributed.elastic.multiprocessing.api:failed

wadoodbaig commented 1 year ago

thanks for great repository..!! When i tried to run bash ../../tools/dist_run.sh ../../tools/data/custom_2d_skeleton.py 4 --video-list custom_list.list --out custom_annos.pkl in diving48_example.ipynb to creat annotations following error comes:

/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 7787) of binary: /home/ubuntu/miniconda3/envs/aiguard/bin/python Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../tools/data/custom_2d_skeleton.py FAILED

Failures: [1]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 1 (local_rank: 1) exitcode : -11 (pid: 7788) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7788 [2]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 2 (local_rank: 2) exitcode : -11 (pid: 7789) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7789 [3]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 3 (local_rank: 3) exitcode : -11 (pid: 7790) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7790

Root Cause (first observed failure): [0]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 0 (local_rank: 0) exitcode : -11 (pid: 7787) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7787

system requiremnts: python=3.8 torch=1.11 mmcv-full =1.5.0 mmdet==2.24.0 mmpose=0.29.0 Gpu tesla T4 ubuntu 20.04

@kennymckormick kindly check this error and guide me if you can..!!

yuchen-ji commented 1 year ago

thanks for great repository..!! When i tried to run bash ../../tools/dist_run.sh ../../tools/data/custom_2d_skeleton.py 4 --video-list custom_list.list --out custom_annos.pkl in diving48_example.ipynb to creat annotations following error comes:

/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 7787) of binary: /home/ubuntu/miniconda3/envs/aiguard/bin/python Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../tools/data/custom_2d_skeleton.py FAILED

Failures: [1]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 1 (local_rank: 1) exitcode : -11 (pid: 7788) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7788 [2]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 2 (local_rank: 2) exitcode : -11 (pid: 7789) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7789 [3]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 3 (local_rank: 3) exitcode : -11 (pid: 7790) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7790

Root Cause (first observed failure): [0]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 0 (local_rank: 0) exitcode : -11 (pid: 7787) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7787

system requiremnts: python=3.8 torch=1.11 mmcv-full =1.5.0 mmdet==2.24.0 mmpose=0.29.0 Gpu tesla T4 ubuntu 20.04

@kennymckormick kindly check this error and guide me if you can..!!

I face the same problem, have you fix it?

666tua commented 1 year ago

I face the same problem, have you fix it?

kennymckormick commented 1 year ago

Hi, wadoodbaig, according to the command you ran, you are trying to use 4 GPUs for skeleton extraction. One thing you can check is that if you have 4 GPUs on this node. Besides, you also need to check are paths in custom_list.list correct given your current working directory.

kennymckormick commented 1 year ago

Recently I also met this problem. I guess the potential reason be a new version of gcc is used to compile the open-mmlab codebases, which lead to some errors. Now I have fixed it by using a very specific conda environment for this project. Please following the new installation guide to reinstall pyskl and see if the problem has been fixed now. Sorry for the late fix.

kennymckormick / pyskl

ERROR:torch.distributed.elastic.multiprocessing.api:failed #98