Open benjamin-bertram opened 1 year ago
Have you solved the problem? I've got a same one.
I also found degradation in image quality after finetuning on the same dataset (I'm using LSUN horse 256 resolution)
I am also facing a similar problem!
Same here ...
Hello, I have found that by predicting the target (x_o), instead of the noise (epsilon), the phenomenon is dramatically reduced. Have you tried to set predict_x_start to true?
Looking forward for your feedback, Stefano
@zengxianyu how to finetuning thanks
Hi, for me only predicting the mean (instead of the mean+variance) by setting learn_sigma=False
solved the problem.
just training longer
@stsavian @Walleeeda
For me, training longer and predict_xstart=True
have not solved the problem (I am using the LSUN Church Outdoor dataset). I am training with learn_sigma=False
now, although I was keeping it to the last since it is shown in the paper that predicting variance should help.
Update: None of the suggested solutions here work for me; I am getting weird tints always.
Are there any additional tricks to use while sampling from models that have been trained with predict_xstart=True
? Currently, the samples are just pitch-black images. Also, it is worth mentioning that the loss $q_0 << q_3$ in this case (which is reversed in the default case of predict_xstart=False
).
same here, I am using LSUN bedroom model
Hi, I have solved the problem (technically, @stsavian's idea, but I will try to put forth my observations).
TL;DR
The solution is to predict $x_0$ (predict_xstart=True
) along with trying out several hyperparameters (notably, image_size
, num_channels
, num_head_channels
). Also, for me, rescale_learned_sigmas=False
worked better.
Some prior context
: my custom dataset has black background in samples, with the content being differently coloured (imagine the MNIST dataset, but the numbers are of different colours, and the images have three channels)
The sampling process calls q_posterior_mean()
, which requires the $x0$ (or $x{start}$). The default training setting predicts $x_0$ from predicted noise (see here), thus not that accurate (i.e., I observed that from noise, it predicts $x_0$ has a uniform background but cannot predict the exact background colour). However, this default setting might work well with a dataset with enough background diversity when trained for longer steps with the proper hyperparameter settings.
Another training setting (predict_xstart=True
) attempts to predict $x_0$ directly instead of noise, hence better at predicting $x_0$ during sampling. However, there might be training instability (model expressivity and NaN loss). For me, it was a complete collapse into complete black samples and no content when I was using incorrect hyperparameter settings.
Sometimes in training i get some weird color schemes in my pictures, while the original data has no tints at all. Is there a reason for it, and how could I avoid it?
Original data is like:
And the output is: