Open wadoodbaig opened 1 year ago
thanks for great repository..!! When i tried to run bash ../../tools/dist_run.sh ../../tools/data/custom_2d_skeleton.py 4 --video-list custom_list.list --out custom_annos.pkl in diving48_example.ipynb to creat annotations following error comes:
/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionsFutureWarning, WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 7787) of binary: /home/ubuntu/miniconda3/envs/aiguard/bin/python Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
../../tools/data/custom_2d_skeleton.py FAILED
Failures: [1]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 1 (local_rank: 1) exitcode : -11 (pid: 7788) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7788 [2]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 2 (local_rank: 2) exitcode : -11 (pid: 7789) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7789 [3]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 3 (local_rank: 3) exitcode : -11 (pid: 7790) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7790
Root Cause (first observed failure): [0]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 0 (local_rank: 0) exitcode : -11 (pid: 7787) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7787
system requiremnts: python=3.8 torch=1.11 mmcv-full =1.5.0 mmdet==2.24.0 mmpose=0.29.0 Gpu tesla T4 ubuntu 20.04
@kennymckormick kindly check this error and guide me if you can..!!
I face the same problem, have you fix it?
I face the same problem, have you fix it?
Hi, wadoodbaig, according to the command you ran, you are trying to use 4 GPUs for skeleton extraction. One thing you can check is that if you have 4 GPUs on this node. Besides, you also need to check are paths in custom_list.list
correct given your current working directory.
Recently I also met this problem. I guess the potential reason be a new version of gcc is used to compile the open-mmlab codebases, which lead to some errors. Now I have fixed it by using a very specific conda environment for this project. Please following the new installation guide to reinstall pyskl and see if the problem has been fixed now. Sorry for the late fix.
thanks for great repository..!! When i tried to run bash ../../tools/dist_run.sh ../../tools/data/custom_2d_skeleton.py 4 --video-list custom_list.list --out custom_annos.pkl in diving48_example.ipynb to creat annotations following error comes:
/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionsFutureWarning, WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 7787) of binary: /home/ubuntu/miniconda3/envs/aiguard/bin/python Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/miniconda3/envs/aiguard/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
../../tools/data/custom_2d_skeleton.py FAILED
Failures: [1]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 1 (local_rank: 1) exitcode : -11 (pid: 7788) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7788 [2]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 2 (local_rank: 2) exitcode : -11 (pid: 7789) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7789 [3]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 3 (local_rank: 3) exitcode : -11 (pid: 7790) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7790
Root Cause (first observed failure): [0]: time : 2022-10-31_14:41:01 host : ip-172-31-8-38.ec2.internal rank : 0 (local_rank: 0) exitcode : -11 (pid: 7787) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7787
system requiremnts: python=3.8 torch=1.11 mmcv-full =1.5.0 mmdet==2.24.0 mmpose=0.29.0 Gpu tesla T4 ubuntu 20.04
@kennymckormick kindly check this error and guide me if you can..!!