kennymckormick / pyskl

A toolbox for skeleton-based action recognition.
Apache License 2.0
959 stars 182 forks source link

ERROR:torch.distributed.elastic.multiprocessing.api:failed #98

Open wadoodbaig opened 1 year ago

wadoodbaig commented 1 year ago

thanks for great repository..!! When i tried to run bash ../../tools/dist_run.sh ../../tools/data/custom_2d_skeleton.py 4 --video-list custom_list.list --out custom_annos.pkl in diving48_example.ipynb to creat annotations following error comes:

/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 7787) of binary: /home/ubuntu/miniconda3/envs/aiguard/bin/python Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../tools/data/custom_2d_skeleton.py FAILED

Failures: [1]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 1 (local_rank: 1) exitcode : -11 (pid: 7788) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7788 [2]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 2 (local_rank: 2) exitcode : -11 (pid: 7789) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7789 [3]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 3 (local_rank: 3) exitcode : -11 (pid: 7790) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7790

Root Cause (first observed failure): [0]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 0 (local_rank: 0) exitcode : -11 (pid: 7787) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7787

system requiremnts: python=3.8 torch=1.11 mmcv-full =1.5.0 mmdet==2.24.0 mmpose=0.29.0 Gpu tesla T4 ubuntu 20.04

@kennymckormick kindly check this error and guide me if you can..!!

yuchen-ji commented 1 year ago

thanks for great repository..!! When i tried to run bash ../../tools/dist_run.sh ../../tools/data/custom_2d_skeleton.py 4 --video-list custom_list.list --out custom_annos.pkl in diving48_example.ipynb to creat annotations following error comes:

/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 7787) of binary: /home/ubuntu/miniconda3/envs/aiguard/bin/python Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../tools/data/custom_2d_skeleton.py FAILED

Failures: [1]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 1 (local_rank: 1) exitcode : -11 (pid: 7788) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7788 [2]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 2 (local_rank: 2) exitcode : -11 (pid: 7789) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7789 [3]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 3 (local_rank: 3) exitcode : -11 (pid: 7790) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7790

Root Cause (first observed failure): [0]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 0 (local_rank: 0) exitcode : -11 (pid: 7787) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7787

system requiremnts: python=3.8 torch=1.11 mmcv-full =1.5.0 mmdet==2.24.0 mmpose=0.29.0 Gpu tesla T4 ubuntu 20.04

@kennymckormick kindly check this error and guide me if you can..!!

I face the same problem, have you fix it?

666tua commented 1 year ago

I face the same problem, have you fix it?

kennymckormick commented 1 year ago

Hi, wadoodbaig, according to the command you ran, you are trying to use 4 GPUs for skeleton extraction. One thing you can check is that if you have 4 GPUs on this node. Besides, you also need to check are paths in custom_list.list correct given your current working directory.

kennymckormick commented 1 year ago

Recently I also met this problem. I guess the potential reason be a new version of gcc is used to compile the open-mmlab codebases, which lead to some errors. Now I have fixed it by using a very specific conda environment for this project. Please following the new installation guide to reinstall pyskl and see if the problem has been fixed now. Sorry for the late fix.