Open bingnoi opened 2 years ago
`2022-07-12` 09:06:11,803 - pyskl - INFO - workflow: [('train', 1)], max: 24 epochs 2022-07-12 09:06:11,804 - pyskl - INFO - Checkpoints will be saved to /content/pyskl/work_dirs/posec3d/slowonly_r50_diving48/joint by HardDiskBackend. tcmalloc: large alloc 1421410304 bytes == 0x16513c000 @ 0x7f96e32b4615 0x592b76 0x4df71e 0x59394f 0x5957cf 0x595b69 0x4e7b1f 0x4ebeeb 0x44f8bc 0x4e9074 0x4ebe42 0x4ec608 0x4eb932 0x4ec55d 0x4e9074 0x4ebe42 0x4ec55d 0x4e9074 0x4ebe42 0x44f841 0x4ec608 0x4e9074 0x4ebe42 0x55e1fa 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x593dd7 tcmalloc: large alloc 1421410304 bytes == 0x7f94e97fe000 @ 0x7f96e32b4615 0x592b76 0x4df71e 0x59394f 0x5957cf 0x595b69 0x4e7b1f 0x4ebeeb 0x44f8bc 0x4e9074 0x4ebe42 0x4ec608 0x4eb932 0x4ec55d 0x4e9074 0x4ebe42 0x4ec55d 0x4e9074 0x4ebe42 0x44f841 0x4ec608 0x4e9074 0x4ebe42 0x55e1fa 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x593dd7 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 634) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in <module> main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ==================================================== tools/train.py FAILED
train on colab(tesla-v4)
Oh, I'm not sure what this problem can be about, maybe you can try to run the experiment on another machine.