kennymckormick / pyskl

A toolbox for skeleton-based action recognition.
Apache License 2.0
981 stars 186 forks source link

error when training on diving48 #58

Open bingnoi opened 2 years ago

bingnoi commented 2 years ago
`2022-07-12` 09:06:11,803 - pyskl - INFO - workflow: [('train', 1)], max: 24 epochs
2022-07-12 09:06:11,804 - pyskl - INFO - Checkpoints will be saved to /content/pyskl/work_dirs/posec3d/slowonly_r50_diving48/joint by HardDiskBackend.
tcmalloc: large alloc 1421410304 bytes == 0x16513c000 @  0x7f96e32b4615 0x592b76 0x4df71e 0x59394f 0x5957cf 0x595b69 0x4e7b1f 0x4ebeeb 0x44f8bc 0x4e9074 0x4ebe42 0x4ec608 0x4eb932 0x4ec55d 0x4e9074 0x4ebe42 0x4ec55d 0x4e9074 0x4ebe42 0x44f841 0x4ec608 0x4e9074 0x4ebe42 0x55e1fa 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x593dd7
tcmalloc: large alloc 1421410304 bytes == 0x7f94e97fe000 @  0x7f96e32b4615 0x592b76 0x4df71e 0x59394f 0x5957cf 0x595b69 0x4e7b1f 0x4ebeeb 0x44f8bc 0x4e9074 0x4ebe42 0x4ec608 0x4eb932 0x4ec55d 0x4e9074 0x4ebe42 0x4ec55d 0x4e9074 0x4ebe42 0x44f841 0x4ec608 0x4e9074 0x4ebe42 0x55e1fa 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x593dd7
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 634) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
tools/train.py FAILED
bingnoi commented 2 years ago
`2022-07-12` 09:06:11,803 - pyskl - INFO - workflow: [('train', 1)], max: 24 epochs
2022-07-12 09:06:11,804 - pyskl - INFO - Checkpoints will be saved to /content/pyskl/work_dirs/posec3d/slowonly_r50_diving48/joint by HardDiskBackend.
tcmalloc: large alloc 1421410304 bytes == 0x16513c000 @  0x7f96e32b4615 0x592b76 0x4df71e 0x59394f 0x5957cf 0x595b69 0x4e7b1f 0x4ebeeb 0x44f8bc 0x4e9074 0x4ebe42 0x4ec608 0x4eb932 0x4ec55d 0x4e9074 0x4ebe42 0x4ec55d 0x4e9074 0x4ebe42 0x44f841 0x4ec608 0x4e9074 0x4ebe42 0x55e1fa 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x593dd7
tcmalloc: large alloc 1421410304 bytes == 0x7f94e97fe000 @  0x7f96e32b4615 0x592b76 0x4df71e 0x59394f 0x5957cf 0x595b69 0x4e7b1f 0x4ebeeb 0x44f8bc 0x4e9074 0x4ebe42 0x4ec608 0x4eb932 0x4ec55d 0x4e9074 0x4ebe42 0x4ec55d 0x4e9074 0x4ebe42 0x44f841 0x4ec608 0x4e9074 0x4ebe42 0x55e1fa 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x593dd7
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 634) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
tools/train.py FAILED

train on colab(tesla-v4)

kennymckormick commented 2 years ago

Oh, I'm not sure what this problem can be about, maybe you can try to run the experiment on another machine.