Open Gigi-Al opened 2 years ago
Sorry I failed to figure out what is the problem. It seems a pytorch problem.
Thanks for the support. Will you add any other file for extracting 2D keypoints of custom datasets? Or add extracted Toyota Smarthome dataset to your repo?
Thanks for the support. Will you add any other file for extracting 2D keypoints of custom datasets? Or add extracted Toyota Smarthome dataset to your repo?
Currently I do not have such plan. But I will ping you once I upload the extracted Toyota Smarthome 2D poses.
I have the same problem.
I have the same problem.
I have the same problem.
I also have a same problem.
But the wired thing is I can train the model using training script. (I mean I can train using dist_train.sh
) But when I use dist_run.sh
it occur above error message.
Same problem! Is there any suggestions? solution?? I have tried the script file for extracting 2D keypoints of custom datasets directly ('without dist_run.sh) but it doesn't work also!!
Recently I also met this problem. I guess the potential reason be a new version of gcc is used to compile the open-mmlab codebases, which lead to some errors. Now I have fixed it by using a very specific conda environment for this project. Please following the new installation guide to reinstall pyskl and see if the problem has been fixed now.
I also have a same problem. But the wired thing is I can train the model using training script. (I mean I can train using
dist_train.sh
) But when I usedist_run.sh
it occur above error message.
Don't use the 'dist_run.sh', it seems work... This problem may be caused by the distributed operation.
Hello,
Thank you for sharing this great repository. I want to train and test the model on other datasets, however when I use diving48_example.ipynb to extract pose data from Diving48 based on your instructions I encounter this error:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 234913) of binary: /usr/bin/python3.8 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/galinezh/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/galinezh/pyskl/tools/data/custom_2d_skeleton.py FAILED
Failures: [1]: time : 2022-06-13_23:04:55 host : coe54000151lws.dyn.uncc.edu rank : 1 (local_rank: 1) exitcode : -11 (pid: 234914) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 234914 [2]: time : 2022-06-13_23:04:55 host : coe54000151lws.dyn.uncc.edu rank : 2 (local_rank: 2) exitcode : -11 (pid: 234915) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 234915 [3]: time : 2022-06-13_23:04:55 host : coe54000151lws.dyn.uncc.edu rank : 3 (local_rank: 3) exitcode : -11 (pid: 234916) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 234916
Root Cause (first observed failure): [0]: time : 2022-06-13_23:04:55 host : coe54000151lws.dyn.uncc.edu rank : 0 (local_rank: 0) exitcode : -11 (pid: 234913) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 234913
Can anybody help me solve this issue? Thanks!