real-stanford / diffusion_policy

[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
https://diffusion-policy.cs.columbia.edu/
MIT License
1.1k stars 206 forks source link

EOF Error in "async_vector_env.py" #36

Closed jehanyang closed 7 months ago

jehanyang commented 7 months ago

Hi, I am trying to run the command in the README.md for Reproducing Simulation Benchmark Results:



============= Initialized Observation Utils with Obs Spec =============

using obs modality: low_dim with keys: ['agent_pos']
using obs modality: rgb with keys: ['image']
using obs modality: depth with keys: []
using obs modality: scan with keys: []
/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
[2023-12-02 23:07:29,677][diffusion_policy.model.diffusion.conditional_unet1d][INFO] - number of parameters: 2.515119e+08
Diffusion params: 2.515119e+08
Vision params: 1.119709e+07
pygame 2.1.2 (SDL 2.0.16, Python 3.9.15)
Hello from the pygame community. https://www.pygame.org/contribute.html
wandb: Currently logged in as: jehanyang (jehan_testcrew). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.16.0 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.3
wandb: Run data is saved locally in /home/projectimit/diffusion_project/diffusion_policy/data/outputs/2023.12.02/23.07.27_train_diffusion_unet_hybrid_pusht_image/wandb/run-20231202_230734-1g8u9a71
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 2023.01.16-20.20.06_train_diffusion_unet_hybrid_pusht_image
wandb: ⭐️ View project at https://wandb.ai/jehan_testcrew/diffusion_policy_debug
wandb: 🚀 View run at https://wandb.ai/jehan_testcrew/diffusion_policy_debug/runs/1g8u9a71
Process Worker<AsyncVectorEnv>-55:                                              
Killed
(robodiff) projectimit@RCHI-CPU-4:~/diffusion_project/diffusion_policy$ Traceback (most recent call last):
  File "/home/projectimit/diffusion_project/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 622, in _worker_shared_memory
    command, data = pipe.recv()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/projectimit/diffusion_project/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 669, in _worker_shared_memory
    pipe.send((None, False))
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process Worker<AsyncVectorEnv>-54:
Traceback (most recent call last):
  File "/home/projectimit/diffusion_project/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 622, in _worker_shared_memory
    command, data = pipe.recv()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError

The above block repeats about 50 times.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/projectimit/diffusion_project/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 669, in _worker_shared_memory
    pipe.send((None, False))
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Exception in thread MsgRouterThr:
Traceback (most recent call last):
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
    msg = self._read_message()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/site-packages/wandb/sdk/interface/router_queue.py", line 36, in _read_message
    msg = self._response_queue.get(timeout=1)
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/queues.py", line 117, in get
    res = self._recv_bytes()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 217, in recv_bytes
    self._check_closed()
  File "/home/projectimit/miniforge3/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 141, in _check_closed
    raise OSError("handle is closed")
OSError: handle is closed
jehanyang commented 7 months ago

It seems like this may have been caused by a lack of RAM. I restarted my computer and saw that around 8GB of RAM were freed, and ran the training again. Now I am at the 185th epoch instead of just the 50th epoch.

zxfever commented 2 days ago

hello bro, i face the same error, have you fixed it? does the RAM lack? how much the RAM should be set?

jehanyang commented 6 hours ago

I don't remember the amount of RAM used, but my solution as previously stated was to restart the computer. Not sure the exact detail that allowed it to work after restarting, but I did note that there was 8GB more RAM available after restarting.