Training exits with AssertionError: assert self.item_in_pipe.value in [0]

ErikKrauter commented 8 months ago

I train a PPO+DAPG agent on 5GPUs in prallel, with 3 processes per GPU for rollouts. I use SLURM as the cluster management system to schedule sbatch jobs.

The issue I encounter is that the sbatch job never terminates even if the training is done. The error log contains the following messages:

Exception ignored in: <function ReplayMemory.__del__ at 0x7f5d4d905b80>
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/env/replay_buffer.py", line 257, in __del__
    self.close()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/env/replay_buffer.py", line 254, in close
    self.file_loader.close()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/file/cache_utils.py", line 494, in close
    self.worker.call('close')
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 157, in call
    self._send_info([self.CALL, [func_name, args], kwargs])
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 145, in _send_info
    assert self.item_in_pipe.value in [0]
AssertionError:

To debug the issue I tried running the same training with only one GPU (still 3 processes for rollout) in SLURM interactive debug mode. The issue persists (same error message as above). When I manually terminate the job with Control+C the following is printed to the terminal:

Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 113, in run
    op, args, kwargs = self.worker_pipe.recv()
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/util.py", line 224, in _call_
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/heap.py", line 278, in free
AttributeError: 'NoneType' object has no attribute 'getpid'
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/util.py", line 224, in _call_
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/heap.py", line 278, in free
AttributeError: 'NoneType' object has no attribute 'getpid'
Exception ignored in: <Finalize object, dead>

Even if I only use one GPU with only one process for rollouts I observer the exact same behavior.

TurnFaucet-v0-train - (train_rl.py:402) - INFO - 2023-11-27,15:42:21 - Save checkpoint at final step 100. The model will be saved at Experiments/Debug2/models/model_final.ckpt.
Exception ignored in: <function ReplayMemory._del_ at 0x7f3e76e4fdc0>
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/env/replay_buffer.py", line 257, in _del_
    self.close()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/env/replay_buffer.py", line 254, in close
    self.file_loader.close()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/file/cache_utils.py", line 494, in close
    self.worker.call('close')
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 157, in call
    self._send_info([self.CALL, [func_name, args], kwargs])
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 145, in _send_info
    assert self.item_in_pipe.value in [0]
AssertionError:
TurnFaucet-v0-train - (run_rl.py:451) - INFO - 2023-11-27,15:42:36 - Close evaluator object
TurnFaucet-v0-train - (run_rl.py:454) - INFO - 2023-11-27,15:42:36 - Close rollout object
TurnFaucet-v0-train - (run_rl.py:457) - INFO - 2023-11-27,15:42:36 - Delete replay buffer
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
Process Worker-1:
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 113, in run
    op, args, kwargs = self.worker_pipe.recv()
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

It seems to me that the problem is likely related to the ReplayMemory class and its interaction with Worker class in parallel_runner.py. This interaction is failing during cleanup, potentially due to resource management or IPC handling within these classes. The fact that the process hangs waiting for pipe communication (recv()) suggests a potential deadlock or synchronization issue. It's possible that the cleanup process is not being executed correctly, leading to a state where a worker process is indefinitely waiting for a signal or data.

The training still works, just the termination does not work properly. Do you have any advice on how to fix the issue or can you propose a workaround of how to make the SLURM job terminate after completion of the training?

ErikKrauter commented 8 months ago

After tinkering around with that issue some more I have noticed that this behavior is not caused by the ReplayMemory class. The behavior also occurs in other scenarios. When executing the training on a GPU that does not support CUDA the script exits with an error, but the SLURM job hangs. When terminating the SLURM job with Control+C the following is printed. This again shows that the process is stuck waiting for communication. It seems the parallel_runner.py script does not handle errors correctly leading to endless loops.

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Error in atexit._run_exitfuncs:
Process Worker-1:
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 113, in run
    op, args, kwargs = self.worker_pipe.recv()
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/scratch_net/got/ekrauter/conda_envs/lab_mani_skill2/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

xuanlinli17 commented 8 months ago

I see. Unfortunately I don't think ManiSkill2 supports non-cuda gpus currently, but I think in the future versions of maniskill it will

ErikKrauter commented 8 months ago

Thank you for your reply. I think my additional comment caused some confusion. In my original issue I was using Cuda GPUs. Just for testing reasons I also tried training on non-cuda GPUs, to see how errors are handled and if the issue persists.

Is there a way to handle errors gracefully so that parallel_runner.py is not caught in an endless loop waiting for some inter process communication?

Currently if an error occurs the run function in parallel_runner.py causes the process to hang indefinitely, which prevents the SLURM job to terminate. This blocks compute resources preventing the next job to be scheduled.

I fixed the original issue by simply commenting the assert statement here. Now, no AssertionError is raised, the python script terminates without errors, and the SLURM job does not hang. However, this does not address the root cause. I still neither understand how the error message from my original issue can be explained, nor how it should be fixed:

Exception ignored in: <function ReplayMemory.__del__ at 0x7f5d4d905b80>
Traceback (most recent call last):
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/env/replay_buffer.py", line 257, in __del__
    self.close()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/env/replay_buffer.py", line 254, in close
    self.file_loader.close()
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/file/cache_utils.py", line 494, in close
    self.worker.call('close')
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 157, in call
    self._send_info([self.CALL, [func_name, args], kwargs])
  File "/scratch_net/got/ekrauter/Masterthesis/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 145, in _send_info
    assert self.item_in_pipe.value in [0]
AssertionError:

lz1oceani commented 8 months ago

Yes. The fix should be fine. The reason of using this assert is to make sure there are no running jobs when sending a new job. But the file loader is asynchronized and always trying to load files. When we try to close the file loading object, the loading job may still running... The best way of fixing may avoid this assertation when sending close signal.

haosulab / ManiSkill2-Learn

Training exits with AssertionError: assert self.item_in_pipe.value in [0] #22