About the step_loss == nan

pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.

574 stars 55 forks source link

About the step_loss == nan #17

Closed maobenz closed 8 months ago

maobenz commented 9 months ago

Hello, Thanks for your brilliant work! When I run the code, I find the step loss always equals nan when I use the bdd dataset. After carefully checking the code, I find the last block of the upsample_block' s output will be nan. I just use the fp16 model and follow the pipeline. Could anyone tell me what is the reason?

Thanks a lot!

pixeli99 commented 9 months ago

Could you please provide some more details, such as your specific settings, device information, and so on?

maobenz commented 9 months ago

Thanks a lot!

I tried different resolution of bdd images but all step_loss is nan. I just use one video clip of the bdd and split the videos into some images to be fed into the model. I have tried the GTX3090 and A100.

When I use the fp32 model , the step loss is not nan but fp16'model' s loss is still nan. However, in the last block of upsample_block, query @ key.transpose(-1, -2) is too large to show nan.

My model id is "stabilityai/stable-video-diffusion-img2vid-xt", but when i tried other model is , it also doesn't work.

maobenz commented 9 months ago

My torch version is 1.13.1+cu116, and my diffusers version is 0.25.0. Even if I input the all zeros as input, the loss is also nan.

maobenz commented 9 months ago

OK， i have found the issue, the torch version should be 2.0.1 rather than 1.13.1. When I change the version of pytorch, the problem has been solved.

pixeli99 commented 9 months ago

Ah, I see, but in fact, I might not be able to answer why modifications to the PyTorch version would cause this issue.😢

xiliu8006 commented 6 months ago

I upgraded the PyTorch to 2.1.2 but still has this problem, I can only train on the bf16. Any solutions?

Sibo2rr commented 3 months ago

I upgraded the PyTorch to 2.1.2 but still has this problem, I can only train on the bf16. Any solutions?

hi did you get any solutions? I get similar problem but the loss is nan