Open edwardjjj opened 4 days ago
Hi @edwardjjj, I actually haven't seen this error before. The only issue I am aware of has to do with NaN observation in Furniture-Bench but that error looks different than this one. Is the GPU out of memory?
Can you share which config you are running? I can try looking into it but I doubt that I can reproduce the error. Meanwhile I suggest you re-run it with CUDA_LAUNCH_BLOCKING=1
and see if the same error shows up.
@ankile, Lars, any idea about this error?
Thank you for getting back to me. After further investigation, we found that these crashes always happen on the same node. So it is possibly a driver version issue. I'll share my findings after more tests.
I see, let me know if you find anything. I would be curious to know. Thanks!
Hi, @edwardjjj and @allenzren, I've seen this type of error in the past, but I've not found any reliable way to reproduce. However, it does seem to happen more frequently when using a very large number of parallel environments, e.g., 2048 envs vs. 1024, so I'm guessing it's an issue of overflowing buffers or other type of memory issue.
Hi @allenzren. I'm trying to reproduce the fine-tuning results of unet diffusion policy on furniture bench. I got these crashes after around 30 iterations.
Do you have some idea what might have gone wrong? Thank you very much.