guided diffusion super resolution network training is diverging

Hello everyone,

I am working with guided diffusion. I would like to reproduce the results of the repository for the 64->256 super resolution network.

My issue is that the upsampled images look good for the first 5000 iterations, but then the loss rapidly increases and the output is only noise from 6000 iterations on.

The difference between my trained model and the authors provided model, is that I don't do any conditioning, so I removed the classifier. The training batch in this way only provides the "low_res" image, and not the output from the classifier "y".

What do you think is wrong here? do you have some hints for debugging?

Please have a look at the produced samples and loss at: https://i.imgur.com/upuzHPg.png (I couldn't embed the image on)

EDIT: my batch size is 3, whereas it seems the original repo has batch size 256. This could be a culprit. However, I have no idea which kind of GPU can load a batch size of 256. As a second question. Which GPU can handle such a massive batch size of 256? is it a multi GPU configuration?

My goal is to produce a solid baseline to work on.

Thanks in advance to anyone who helps,

Stefano

openai / guided-diffusion

guided diffusion super resolution network training is diverging #73