real-stanford / diffusion_policy

[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
https://diffusion-policy.cs.columbia.edu/
MIT License
1.39k stars 262 forks source link

执行gym项目中的async_vector_env.py文件里的reset_async函数时,diffusion_policy崩溃 #42

Open tll1945-eng opened 9 months ago

tll1945-eng commented 9 months ago

我在阿里云上租用了一块V100,当diffusion_policy被安装到阿里云之后,按照diffusion_policy中给出的说明,在Running for a single seed方式下,执行 python train.py --config-dir=. --config-name=image_pusht_diffusion_policycnn.yaml training.seed=42 training.device=cuda:0 hydra.run.dir='data/outputs/${now:%Y.%m.%d}/${now:%H.%M.%S}${name}_${task_name}' 指令时,程序只能正常运行一个批次的训练。当进行完一个批次的训练以后,计算机调用gym项目中的async_vector_env.py文件里的reset_async函数时,出现了崩溃现象。是不是在pusht_image_runner.py文件中的run(self, policy: BaseImagePolicy)函数里,一些语句写错了,从而引发了程序执行的异常,还要把源代码修改修改才能正常进行?或者说,是不是单独一块V100执行不了diffusion_policy,从而引发了上面所说的程序执行异常?

Lijinzh commented 6 months ago

My 4090 GPU has the same error too, the computer will crash after training up to 2000 epochs, i have to reset the computer to get it restart.

abcdsaltfish commented 1 month ago

I encounted this problem too.

Details

Exception ignored in: Traceback (most recent call last): File "/root/mambaforge/envs/robodiff/lib/python3.9/site-packages/gym/vector/vector_env.py", line 139, in __del__ self.close(terminate=True) File "/root/mambaforge/envs/robodiff/lib/python3.9/site-packages/gym/vector/vector_env.py", line 121, in close self.close_extras(**kwargs) File "/opt/data/private/diffusionp/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 327, in close_extras function(timeout) File "/opt/data/private/diffusionp/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 290, in step_wait results, successes = zip(*[pipe.recv() for pipe in self.parent_pipes]) File "/opt/data/private/diffusionp/diffusion_policy/diffusion_policy/gym_util/async_vector_env.py", line 290, in results, successes = zip(*[pipe.recv() for pipe in self.parent_pipes]) File "/root/mambaforge/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/root/mambaforge/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/root/mambaforge/envs/robodiff/lib/python3.9/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError:

My GPU information is as follows

Details

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3090 Off | 00000000:B1:00.0 Off | N/A | | 37% 37C P8 24W / 350W | 0MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

abcdsaltfish commented 1 month ago

Solved. Just request more ARM like 48GB.

This is the same as issue 36. See EOF Error in "async_vector_env.py" #36.