Thank you for your work! I am trying to run the code for stage2 and encountered the following issue. According to the documentation, it seems to be due to insufficient memory. I have tried to reduce the value of SOLVER.IMS_PER_BATCH to 4, but it still doesn't work. Could you please advise on how to resolve this?
saved triplet length cov. 5334
100%|███████████████████████████████████████████████████████████████████████████████████| 25/25 [00:02<00:00, 9.97it/s]
100%|██████████████████████████████████████████████████████████████████████████| 28523/28523 [00:00<00:00, 60861.11it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 2155) of binary: /opt/conda/envs/DRM/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/DRM/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/DRM/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/DRM/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/opt/conda/envs/DRM/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/envs/DRM/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/envs/DRM/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/envs/DRM/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/DRM/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/relation_train_net.py FAILED
---
Failures:
<NO_OTHER_FAILURES>
---
Root Cause (first observed failure):
[0]:
time : 2024-11-01_15:22:43
host : ts-8b1d810992dce6c10192e66e47bc1511-launcher.ts-8b1d810992dce6c10192e66e47bc1511-launcher.map-tlab-group1-chongqing-3.svc.cluster.local
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 2155)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 2155
============================================================
Thank you for your work! I am trying to run the code for stage2 and encountered the following issue. According to the documentation, it seems to be due to insufficient memory. I have tried to reduce the value of SOLVER.IMS_PER_BATCH to 4, but it still doesn't work. Could you please advise on how to resolve this?