kennymckormick / pyskl

A toolbox for skeleton-based action recognition.
Apache License 2.0
972 stars 186 forks source link

Extracting 2D Poses Using diving48_example.ipynb #45

Open Gigi-Al opened 2 years ago

Gigi-Al commented 2 years ago

Hello,

Thank you for sharing this great repository. I want to train and test the model on other datasets, however when I use diving48_example.ipynb to extract pose data from Diving48 based on your instructions I encounter this error:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 234913) of binary: /usr/bin/python3.8 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/galinezh/pyskl/tools/data/custom_2d_skeleton.py FAILED

Failures: [1]: time : 2022-06-13_23:04:55 host : coe54000151lws.dyn.uncc.edu rank : 1 (local_rank: 1) exitcode : -11 (pid: 234914) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 234914 [2]: time : 2022-06-13_23:04:55 host : coe54000151lws.dyn.uncc.edu rank : 2 (local_rank: 2) exitcode : -11 (pid: 234915) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 234915 [3]: time : 2022-06-13_23:04:55 host : coe54000151lws.dyn.uncc.edu rank : 3 (local_rank: 3) exitcode : -11 (pid: 234916) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 234916

Root Cause (first observed failure): [0]: time : 2022-06-13_23:04:55 host : coe54000151lws.dyn.uncc.edu rank : 0 (local_rank: 0) exitcode : -11 (pid: 234913) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 234913

Can anybody help me solve this issue? Thanks!

kennymckormick commented 2 years ago

Sorry I failed to figure out what is the problem. It seems a pytorch problem.

Gigi-Al commented 2 years ago

Thanks for the support. Will you add any other file for extracting 2D keypoints of custom datasets? Or add extracted Toyota Smarthome dataset to your repo?

kennymckormick commented 2 years ago

Thanks for the support. Will you add any other file for extracting 2D keypoints of custom datasets? Or add extracted Toyota Smarthome dataset to your repo?

Currently I do not have such plan. But I will ping you once I upload the extracted Toyota Smarthome 2D poses.

Thoye commented 2 years ago

I have the same problem.

cici203 commented 2 years ago

I have the same problem.

xxxnhb commented 2 years ago

I have the same problem.

myeongjun-ds commented 2 years ago

I also have a same problem. But the wired thing is I can train the model using training script. (I mean I can train using dist_train.sh) But when I use dist_run.sh it occur above error message.

samouha commented 1 year ago

Same problem! Is there any suggestions? solution?? I have tried the script file for extracting 2D keypoints of custom datasets directly ('without dist_run.sh) but it doesn't work also!!

kennymckormick commented 1 year ago

Recently I also met this problem. I guess the potential reason be a new version of gcc is used to compile the open-mmlab codebases, which lead to some errors. Now I have fixed it by using a very specific conda environment for this project. Please following the new installation guide to reinstall pyskl and see if the problem has been fixed now.

LijiaDong1220 commented 1 year ago

I also have a same problem. But the wired thing is I can train the model using training script. (I mean I can train using dist_train.sh) But when I use dist_run.sh it occur above error message.

Don't use the 'dist_run.sh', it seems work... This problem may be caused by the distributed operation.