Closed ErikKrauter closed 7 months ago
After tinkering around with that issue some more I have noticed that this behavior is not caused by the ReplayMemory class. The behavior also occurs in other scenarios. When executing the training on a GPU that does not support CUDA the script exits with an error, but the SLURM job hangs. When terminating the SLURM job with Control+C the following is printed. This again shows that the process is stuck waiting for communication. It seems the parallel_runner.py script does not handle errors correctly leading to endless loops.
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Error in atexit._run_exitfuncs:
Process Worker-1:
Traceback (most recent call last):
File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 113, in run
op, args, kwargs = self.worker_pipe.recv()
File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
I see. Unfortunately I don't think ManiSkill2 supports non-cuda gpus currently, but I think in the future versions of maniskill it will
Thank you for your reply. I think my additional comment caused some confusion. In my original issue I was using Cuda GPUs. Just for testing reasons I also tried training on non-cuda GPUs, to see how errors are handled and if the issue persists.
Is there a way to handle errors gracefully so that parallel_runner.py is not caught in an endless loop waiting for some inter process communication?
Currently if an error occurs the run function in parallel_runner.py causes the process to hang indefinitely, which prevents the SLURM job to terminate. This blocks compute resources preventing the next job to be scheduled.
I fixed the original issue by simply commenting the assert statement here. Now, no AssertionError is raised, the python script terminates without errors, and the SLURM job does not hang. However, this does not address the root cause. I still neither understand how the error message from my original issue can be explained, nor how it should be fixed:
Exception ignored in: <function ReplayMemory.__del__ at 0x7f5d4d905b80>
Traceback (most recent call last):
File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/env/replay_buffer.py", line 257, in __del__
self.close()
File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/env/replay_buffer.py", line 254, in close
self.file_loader.close()
File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/file/cache_utils.py", line 494, in close
self.worker.call('close')
File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 157, in call
self._send_info([self.CALL, [func_name, args], kwargs])
File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 145, in _send_info
assert self.item_in_pipe.value in [0]
AssertionError:
Yes. The fix should be fine. The reason of using this assert is to make sure there are no running jobs when sending a new job. But the file loader is asynchronized and always trying to load files. When we try to close the file loading object, the loading job may still running... The best way of fixing may avoid this assertation when sending close signal.
I train a PPO+DAPG agent on 5GPUs in prallel, with 3 processes per GPU for rollouts. I use SLURM as the cluster management system to schedule sbatch jobs.
The issue I encounter is that the sbatch job never terminates even if the training is done. The error log contains the following messages:
To debug the issue I tried running the same training with only one GPU (still 3 processes for rollout) in SLURM interactive debug mode. The issue persists (same error message as above). When I manually terminate the job with Control+C the following is printed to the terminal:
Even if I only use one GPU with only one process for rollouts I observer the exact same behavior.
It seems to me that the problem is likely related to the ReplayMemory class and its interaction with Worker class in parallel_runner.py. This interaction is failing during cleanup, potentially due to resource management or IPC handling within these classes. The fact that the process hangs waiting for pipe communication (recv()) suggests a potential deadlock or synchronization issue. It's possible that the cleanup process is not being executed correctly, leading to a state where a worker process is indefinitely waiting for a signal or data.
The training still works, just the termination does not work properly. Do you have any advice on how to fix the issue or can you propose a workaround of how to make the SLURM job terminate after completion of the training?