jasonkyuyim / se3_diffusion

Implementation for SE(3) diffusion model with application to protein backbone generation
https://arxiv.org/abs/2302.02277
MIT License
305 stars 50 forks source link

Unexpected NaN Error Encountered When Training with NVIDIA GeForce RTX 4090 GPU #23

Closed Z-MU-Z closed 4 months ago

Z-MU-Z commented 1 year ago

I encountered an issue while using a specific GPU (NVIDIA GeForce RTX 4090) for training(However, when I use the same environment on 3090, the training process runs smoothly without any errors). During the training process, I received the following error message:

[2023-06-08 17:50:11,717][main][INFO] - [1]: total_loss=21.1373 rot_loss=3.0088 trans_loss=4.7970 bb_atom_loss=12.6966 dist_mat_loss=0.6349 examples_per_step=11.0000 res_length=162.0000, steps/sec=263.64957 Error executing job with overrides: [] Traceback (most recent call last): ... Exception: NaN encountered

Z-MU-Z commented 1 year ago

During repeated tests, I also observed significant variation in the loss values during the first iteration. Here are the loss values for reference: [2023-06-08 17:46:05,132][main][INFO] - [1]: total_loss=5423033104869406.0000 rot_loss=5423033104869354.0000 trans_loss=3.3963 bb_atom_loss=39.0931 dist_mat_loss=9.6093 examples_per_step=3.0000 res_length=291.0000, steps/sec=5.74526

[2023-06-08 17:50:11,717][main][INFO] - [1]: total_loss=21.1373 rot_loss=3.0088 trans_loss=4.7970 bb_atom_loss=12.6966 dist_mat_loss=0.6349 examples_per_step=11.0000 res_length=162.0000, steps/sec=263.64957

[2023-06-08 17:55:54,079][main][INFO] - [1]: total_loss=7.0063 rot_loss=2.9029 trans_loss=4.1034 bb_atom_loss=0.0000 dist_mat_loss=0.0000 examples_per_step=4.0000 res_length=268.0000, steps/sec=235.68194

However, regardless of the initial loss variation, the training process consistently encounters an error on the second iteration

jasonkyuyim commented 1 year ago

Hi,

Regarding the NaNs with different GPUs, I don't know what's happening and can't offer help. I train with A6000 or A100 on my servers. It's likely something weird with CUDA. I would expect there to be significant variance of the loss on the first iteration. The batch sizes can be pretty small so you may get a bad batch on the first step. As long as training stabilizes it should be fine. What error are you running into on the second iteration? Is it the NaN so it's about the GPU device?

Z-MU-Z commented 1 year ago

I found that during the first time the loss backpropagated, many .grad became Nan, but I don't know the specific reason yet.

jasonkyuyim commented 1 year ago

Hi, we just posted a update that affects the rotation score learning. Please take a look at our README. I'm not sure if it's related to your problem but it might help.

amorehead commented 1 year ago

@Z-MU-Z, I personally have not encountered an error like this while reproducing the results for this model. The issue might be related to your local CUDA version or (more specifically) your GPU. Are your local CUDA drivers up to date?

jasonkyuyim commented 4 months ago

Closing due to inactivity.