NaN loss at fp16 - Githubissues

rohitgandikota / sliders

Concept Sliders for Precise Control of Diffusion Models

https://sliders.baulab.info

MIT License

819 stars 67 forks source link

Open AI-Casanova opened 9 months ago

AI-Casanova commented 9 months ago

Attempting to train on a Colab T4, which requires fp16 precision.

All scripts report NaN loss after the first Network update, no matter the hyperparameters I choose.

Any advice for future experimentation?

rohitgandikota commented 9 months ago

huh weird - but is bfloat16 working for you?

AI-Casanova commented 9 months ago

T4 does not support bfloat16 natively

Will try float32 when I get back to a work station.

It is quite odd, I get loss readings on the first iteration, but as soon as the Network kicks in, both loss and latents are all NaN

rohitgandikota commented 9 months ago

Found this issue on stabilityAI repo - it seems like some issue with fp16 support on diffusers. even I could not get it working

AI-Casanova commented 9 months ago

I tried side loading the fp16-fix VAE (which I also use for Koyha LoRA training) but even that was failing.

rohitgandikota commented 9 months ago

sdxl in regular inference seems to be doing it for me (simple inference, no sliders). I am not sure what the reason is.

Essentially the first step looks good, but after that, I am just seeing black images. I looked at the tensors they are all NaN values

AI-Casanova commented 9 months ago

The VAE that came packaged with sdxl had activations that were too large for fp16.

People who do fp16 inference use the sdxl0.9 VAE or this one that has been rescaled. https://huggingface.co/madebyollin/sdxl-vae-fp16-fix

loboere commented 6 months ago

I am trying to train in colab t4 but I am getting this error I also have this problem but I'm trying to train in sd1.4, did anyone solve it?

GalDude33 commented 1 month ago

It is working when running with float32 precision