openai / guided-diffusion

MIT License
6.06k stars 807 forks source link

Does max_beta=0.999 in cosine schedule make any sense ? #42

Open unrealwill opened 2 years ago

unrealwill commented 2 years ago

Hello,

https://github.com/openai/guided-diffusion/blob/8fb3ad9197f16bbc40620447b2742e13458d2831/guided_diffusion/gaussian_diffusion.py#L36-L45

I get this is a strict implementation of paper https://arxiv.org/pdf/2102.09672.pdf, but I don't understand how could max_beta=0.999 not be a bug.

In my personal loose implementation of this paper, I had to set max_beta = 0.02 which is the end point of the linear schedule, to get working results.

In equation (13) of the Improved Diffusion paper, mu(xt,t) = 1/sqrt(alpha[t]) *( xt - beta[t] / sqrt(1-alphabar[t]) * eps(thetat,t)

At the start of the reverse diffusion process when t=max T ,

xt is Normal(0,1), 
eps aims to be Normal(0,1), 
beta[t] = clipped_value = 0.999,
alpha[t] = 1-beta[t] = 0.001,
1/sqrt(alpha[t]) ~ 31.6,
alphabar[t] ~0 because it's forgetting the initial x0
beta[t] / sqrt(1-alphabar[t])  ~ 1

This mean that variance( mu(xt,t) ) ~ 30 this mean that the variance of x[t-1] ~ 30

In the paper equation (9) The neural network inputs are trained with : xt = sqrt(alphabar[t])*x0 + sqrt(1-alphabar[t])*eps which is roughly of variance ~1

This means that all the sampling will be made from sample with variance 30 while having been trained with variance around 1. Even if the model normalize its input internally, it screws the ratio of the predicted variance and therefore the diffusion process is dominated by the first few terms, because the network will predict a variance ~30 times smaller.

In my personal loose implementation I have decided to use the prediction of the noise (Ho-style) instead of the prediction of mu as you seem to have chosen here, and therefore I am much more sensitive to this bug.

But even predicting mu directly, if you predict mu correctly this mean you will get out of the training zone during the diffusion process (which you seem to mitigate with (dubious ?) clipping), and if you predict it incorrectly because its weight is low (by sheer luck?) it's just added noise to training process.

In the paper you explain that max_beta should be < 1 to avoid singularities, but can you clarify the reasoning for max_beta=0.999 in the range [0.02-0.999] ?

Thanks

singwang-cn commented 1 year ago

I also noticed that the x_(t-1) will increase to a very large value causing the failure of the generation because of 1/sqrt(alpha[t]) ~ 31.6 (both DDPM and DDIM generation suffer the same problem). A temporary solution is to skip the last 20-40 time steps in generation. I am still finding a solution to deal with this problem.

MaxxP0 commented 1 year ago

has this issue been fixed?

stsavian commented 1 year ago

@unrealwill sorry for my possibly trivial question, when you say:

In my personal loose implementation I have decided to use the prediction of the noise (Ho-style) instead of the prediction of mu as you seem to have chosen here, and therefore I am much more sensitive to this bug.

What do you mean that they are trying to predict mu (as in eq 11 from improved diffusion). According to the code it seems to me that either they are predicting the noise (epsilon), or x_start(x_0). This can be seen here(line 410 of script_util.py):
model_mean_type=( gd.ModelMeanType.EPSILON if not predict_xstart else gd.ModelMeanType.START_X

Am I missing something? Does it makes sense to you?

Thanks, Stefano

unrealwill commented 1 year ago

@stsavian I was just saying that the bug manifest itself more evidently when one is trying to predict epsilon instead of x_0. But even if the code converge when predicting x_0 despite of the bug, as far as I understand it should still hinders training performance because it's adding noise in an uncontrolled fashion.

stsavian commented 1 year ago

@unrealwill thanks for your kind reply! Indeed I am having some problems getting the model to converge. I would like to better explain my experiments to you. With my settings, I found that predicting the noise always gets me some wrong estimations in areas with uniform backgrounds, as in this https://github.com/openai/guided-diffusion/issues/81; instead predicting x_0 seems to lead to better sampling. After seeing your issue I have tried to compare the linear schedule and cosine schedule, to see if performance changes. To me, both schedules lead to the same performance. So I wonder if the problem was max_beta=0.999 with the cosine schedule. I think there might be a complex interplay between the noise training schedule, sampling steps, and the type of data.

My data is a matrix of floats, normalizing the data (x_0) to 1 (max(x_o)=1) seems to reduce the phenomenon when used in conjunction with predicting the target, instead diving by the dataset standard deviation worsens performances. Also, the loss values (simpling predicting the MSE between target and x_t) can change a lot depending on the type of normalization. However, I find the loss values (multiplying factor) not particularly indicative of the produced quality.

So, all of this is to say that: i) I am having trouble understanding if there should be a specific relationship between the input data values (x_0) and the noise added (beta); ii) predicting the noise is supposed to be equivalent to predicting x_0, so am I stunting my model with certain hyperparameters? iii) I am now running some experiments with extreme schedules, e.g. linear with very low beta (or very large beta); cosine for different betas.

Hopefully, this could be helpful for someone and help me as well, Stefano

unrealwill commented 1 year ago

@stsavian It seems you are in the numerical debugging phase of your development. The thing is that one bug can hide another, and it's not until you have eliminated them all that you will get good convergence. If you have convergence problem with linear schedule you probably have an additional bug which need to resolved first.

For efficient bug hunting I usually like to make a sequence of increasingly complex code and dataset, starting from something as simple as possible that converge properly, and then morph it into something more complex while maintaining convergence along the way. I find it faster than fiddling with an exponential combination of settings (But if you have infinite compute you can probably spin-up grid search to find good settings). The cosine_schedule with max_beta=0.999 didn't make the cut to my code-base, it smells fishy to me, and I'd advise using a different default.

stsavian commented 1 year ago

@unrealwill thanks for your advice! I will make good use of it!

jamesheald commented 1 year ago

@stsavian @unrealwill Can I ask what conclusion you came to/ended up doing here? I am having the same problem. When I predict the noise using the cosine noise schedule with beta max 0.999, the magnitude of samples from the reverse process scale with the number of diffusion steps (having order of magnitude in the hundreds or thousands). I don't generate sensible samples when I train the model in this way (my samples look like noise; if I predict x_0 instead of epsilon things look better). This is my first time implementing a DDPM so I'm not sure what i'm doing wrong.

joaolcguerreiro commented 1 year ago

I'm curious about this. Does anyone have a response already? Is there an error in the paper?

Also, should betas be clipped in both upper and lower bounds? Should there be a beta_min like 0? Or betas should be clip(betas, max_beta)?