Instability in 3D diffusion Model training

We are training a 3D diffusion model on medical data using this library and facing some instabilities during training. More in detail: The image quality stagnates/decreases after 80+ hours of training and the images are still noisy, the loss curve doesn’t decrease anymore. Now taking the assumption that the model is stuck in a local minima, we started to increase the learning rate as soon as we encounter this behavior. However, we are wondering if other people encountered similar instabilities and/or if you have ideas on how to deal with it? We’re also experimenting with batch size (max 2 due to 3D), channel size (96 or lower due to 3D), the Unet Architecture itself, with or without Variance, and testing different learning rates (currently starting at lr-6). Any suggestions or ideas are very much appreciated!

openai / guided-diffusion

Instability in 3D diffusion Model training #115