Open xinmiaolin opened 2 years ago
@xinmiaolin hi, it looks like your training run diverged early
the loss should go down to around 0.05 before the images come into view
how high is your learning rate?
The learning rate is 4e-3. Yes, the training loss does drop very precipitously from around 1 to 0.1 in the first epoch, then sharply increased to 0.8 again. Then, the training loss decreases steadily. The 0.05 is the loss per image you mean? The batch size is 128 that I used.
Thank you very much!
@xinmiaolin do you mean 3e-4
because 4e-3
is absurdly high!
@xinmiaolin would recommend 1e-4
@xinmiaolin yea, the loss should be the MSE, which is average across image samples
The learning rate is 4e-3. Yes, the training loss does drop very precipitously from around 1 to 0.1 in the first epoch, then sharply increased to 0.8 again. Then, the training loss decreases steadily. The 0.05 is the loss per image you mean? The batch size is 128 that I used.
Thank you very much!
yup, when losses jump up like that, it means the training is unhealthy. some of these jumps are not recoverable
I will try a lower learning rate to see if this kind of jumps appear again.
@xinmiaolin you'll definitely see something
DDPMs are so much easier to train than their predecessor, GANs
@xinmiaolin you'll definitely see something
DDPMs are so much easier to train than their predecessor, GANs
Thanks, this definitely gives me hope haha
@xinmiaolin did it work?
Hi, I don't think that it is working...
Here is when lr=1e-4
, although it is diverging, but the loss can still decrease.
This is when lr=3e-6
, the loss oscillates a lot and does not decrease at all, even increases.
I am confused why when the lr is smaller, the loss function has such great fluctuations. I am now training with lr=1e-5
, it looks good. Will keep on updating!
@xinmiaolin how high is your batch size?
@xinmiaolin how high is your batch size?
the batch size is 128
ohh that should be good enough
ok, keep at it with 1e-5
, i'd be surprised if that didn't work
ohh that should be good enough
ok, keep at it with
1e-5
, i'd be surprised if that didn't work
ok, I will come back and update. Thanks
@xinmiaolin i'll add a learning rate warmup for the decoder trainer some time this weekend; just need to figure out how to make it huggingface accelerate compatible
@xinmiaolin i'll add a learning rate warmup for the decoder trainer some time this weekend; just need to figure out how to make it huggingface accelerate compatible
Ok thanks!
This is the training of lr=1e-5
, I have also used CosineAnnealingLR with t_max=10000
. The loss increases from around 0.3 to 0.6 then does not decrease at all.
For lr=3e-5
, the loss increases from around 0.2 to 0.7, then decreases slowly.
I have tried smaller learning rates without the lr scheduler, but the loss also fluctuates, for example, when lr=3e-7
.
I am not sure what is going on. Could there be some parameters related to diffusion models that should be changed, because I am using default values of the parameters.
@xinmiaolin hmm, could you possibly send me your full training script?
@xinmiaolin hmm, could you possibly send me your full training script?
sure, thanks!
Did you check your optimizer? I think it maybe because your parameters are not doing the backward. I met the problem before, the loss decreased slowly but I figured out there is something wrong in optimizer.py.
Did you check your optimizer? I think it maybe because your parameters are not doing the backward. I met the problem before, the loss decreased slowly but I figured out there is something wrong in optimizer.py.
Hi thank you for the suggestion. I will check on it.
@xinmiaolin Have you solved the problem or have you trained the model yet, I have the same problem
Facing the same problem, would love to know if you found out the issue! :)
cc: @xinmiaolin
@xinmiaolin Same for me, did you found out the root cause of that behaviour ?
Did you figure it out?? I met the problem too.
Hi,
I am trying to train a decoder on CUB-200 dataset. However, after nearly 400 epochs, the decoder can still only generate noises and I have been trying a long time to figure out why. I would appreciate your suggestions!
So the training loss decreases pretty steadily, but the images still look like this:
I have only used 1 unet and the configuration is:
dim = 128, image_embed_dim = 768, dim_mults = (1, 2, 4, 8),
, for the decoder, the parameters are:image_size=64, timesteps = 1000, image_cond_drop_prob = 0.1, text_cond_drop_prob = 0.5, learned_variance = False
.Thank you!