Crashes for Unet finetuning

edwardjjj commented 4 days ago

Hi @allenzren. I'm trying to reproduce the fine-tuning results of unet diffusion policy on furniture bench. I got these crashes after around 30 iterations.

823 Error executing job with overrides: ['train.val_freq=50', 'train.n_train_itr=500']
2824 Traceback (most recent call last):
2825   File "script/train.py", line 87, in main
2826     agent.run()
2827   File "/home/edward/projects/dppo/agent/finetune/train_ppo_diffusion_agent.py", line 126, in run
2828     obs_venv, reward_venv, done_venv, info_venv = self.venv.step(
2829   File "/home/edward/projects/dppo/env/gym_utils/wrapper/furniture.py", line 124, in step
2830     obs, sparse_reward, dense_reward, done, info = self._inner_step(action)
2831   File "/home/edward/projects/dppo/env/gym_utils/wrapper/furniture.py", line 147, in _inner_step
2832     obs, reward, done, info = self.env.step(action_chunk[:, i, :])
2833   File "/home/edward/anaconda3/envs/dppo/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2834     return func(*args, **kwargs)
2835   File "/home/edward/projects/dppo/furniture-bench/furniture_bench/envs/furniture_rl_sim_env.py", line 2017, in step
2836     obs = self.get_observation()
2837   File "/home/edward/projects/dppo/furniture-bench/furniture_bench/envs/furniture_rl_sim_env.py", line 1293, in get_observation
2838     robot_state = self._read_robot_state()
2839   File "/home/edward/projects/dppo/furniture-bench/furniture_bench/envs/furniture_rl_sim_env.py", line 1160, in _read_robot_state
2840     ee_pos, ee_quat = self.get_ee_pose()
2841   File "/home/edward/projects/dppo/furniture-bench/furniture_bench/envs/furniture_rl_sim_env.py", line 1211, in get_ee_pose
2842     hand_pos = self.rb_states[self.ee_idxs, :3]
2843 RuntimeError: CUDA error: an illegal memory access was encountered
2844 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2845 For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2846 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2847 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Do you have some idea what might have gone wrong? Thank you very much.

allenzren commented 4 days ago

Hi @edwardjjj, I actually haven't seen this error before. The only issue I am aware of has to do with NaN observation in Furniture-Bench but that error looks different than this one. Is the GPU out of memory?

Can you share which config you are running? I can try looking into it but I doubt that I can reproduce the error. Meanwhile I suggest you re-run it with CUDA_LAUNCH_BLOCKING=1 and see if the same error shows up.

allenzren commented 4 days ago

@ankile, Lars, any idea about this error?

edwardjjj commented 2 days ago

Thank you for getting back to me. After further investigation, we found that these crashes always happen on the same node. So it is possibly a driver version issue. I'll share my findings after more tests.

allenzren commented 1 day ago

I see, let me know if you find anything. I would be curious to know. Thanks!

ankile commented 1 day ago

Hi, @edwardjjj and @allenzren, I've seen this type of error in the past, but I've not found any reliable way to reproduce. However, it does seem to happen more frequently when using a very large number of parallel environments, e.g., 2048 envs vs. 1024, so I'm guessing it's an issue of overflowing buffers or other type of memory issue.

irom-lab / dppo

Crashes for Unet finetuning #4