Open fire2323 opened 3 months ago
the training curves look like the following.
and the fid curve looks like the following
My impression is that the coco fid under low guidance regime is not super meaningful so I didn't spend too much time analyzing it...
I expected the g loss to be decreasing as the d loss does. But the FID keeps decreasing until close (a bit of above) to the level in your paper. I use batch size smaller than yours. Would this affects the g loss performance?
If d loss decreases, that means that the classifier can classify the generated sample well, then the g loss is expected to get larger right? To make g loss smaller, I remember i tried to use larger gen_cls_loss_weight but this would lead to training instability.
And, I noticed that the weighted g loss is one magnitude smaller than the dmd loss, like g loss is 0.001 while dmd loss is 0.01. Is this the case in your training and expected ?
The value is not exactly comparable. basically, we have a mean reduction here https://github.com/tianweiy/DMD2/blob/0f8a481716539af7b2795740c9763a7d0d05b83b/main/sd_guidance.py#L241 so the gradient norm corresponds to the dmd loss is actually scaled smaller (by the shape of the tensor).
Thanks for sharing! Another issue I am curious about is the cfg. Did you try to train a cfg unet for the generator (or/and fake diffusion model) as you did for the real diffusion model (i.e. 8 for sdxl)? And I am wondering three things: 1. if it will have further performance enhancement after using cfg training or not? 2. can the generator without cfg training be used for inference with cfg ? 3. if trained using some cfg guidance scale (e.g. 8) for the generator (or/and fake diffusion model), can the final output generator be used for inference with other guidance scales (e.g. 6 or 9)? Would it be necessary to train different models with different specific guidance scales and why?
I tried a few ways to train the fake diffusion model with CFG too. But I didn't manage to get better results than no cfg.
I think probably no because the current prediction target for the distilled model is not correlated to score anymore (it is a sharp image). Though I might be wrong. I haven't followed recent developments too much.
I think it might be possible to do guidance conditioning (i.e. generator gets an extra parameter to indicate what the current guidance will be). I briefly explored this in the past but didn't get it to work (the generator trained with varying guidance is quite a bit worse than trained with a single guidance). I think it is probably a bit challenging considering that all other few-step generators are mostly fixed to a single guidance. But it might worth exploring
got it, thanks, how about the performance by training the generator with a single guidance ? does it work in inference?
training the generator with a single guidance ? does it work in inference
What does this mean ? We are already training with a single guidance now?
oh, sorry for the misunderstanding. From your description
the generator trained with varying guidance is quite a bit worse than trained with a single guidance
, I supposed that you had trained with a single guidance for the generator. So it leads to the question above: If that is the case, how about the performance compared to generator trained without a guidance? (to clarify, when I say the generator , I mean the self.feedforward_model in class SDUniModel (code lines 39). And that is also the meaning for generator as in your reply right?)
Thank you again for your great work ! :-)
If that is the case, how about the performance compared to generator trained without a guidance
Could you elaborate on this ? If we don't apply guidance for the real unet (aka train generator without guidance). the image is really really bad
There are 3 unets in the paper: NET 1: the generator (feedforward_model) which is the output network and used in inference; NET 2. the real unet; NET 3. the fake unet.
I mean NET 1 (the generator), when I am saying the generator in my post above (not NET 2), and I mean applying a guidance on NET 1 (also not NET 2):
If that is the case, how about the performance compared to generator trained without a guidance
In the code, a guidance is applied to NET 2 (the real unet indeed), but not applied to NET 1. I mean how the comparison would be for NET 1, regarding to applying guidance on NET 1: a) trained with a guidance applying on NET 1 and b) not trained with a guidance applying on NET 1 as it is now in the code. Below is the related code not applied a guidance on NET 1. "generated_noise" is dirrectly used to generate x0 not applying a guidance on the generated_noise.
generated_noise = self.feedforward_model( noisy_image, timesteps.long(), text_embedding, added_cond_kwargs=unet_added_conditions ).sample
generated_image = get_x0_from_noise( noisy_image, generated_noise.double(), self.alphas_cumprod.double(), current_timesteps ).float()
ic. I never tried this setting and theoretically I don't know what applying guidance to NET 1 mean mathematically for these one / few step samplers.
1.when using the experiments script (sdv1.5 stage 2.): laion6.25_sd_baseline_8node_guidance1.75_lr5e-7_seed10_dfake10_diffusion1000_gan1e-3_noode_resume_fixdata.sh, gen_cls_loss (g loss) keeps increasing while guidance_cls_loss (d loss) deacreases. Is this the case in your training ? (ps: In stage 1, the FID has been achieved with your code.)
I expected the g loss to be decreasing as the d loss does. But the FID keeps decreasing until close (a bit of above) to the level in your paper. I use batch size smaller than yours. Would this affects the g loss performance?
And, I noticed that the weighted g loss is one magnitude smaller than the dmd loss, like g loss is 0.001 while dmd loss is 0.01. Is this the case in your training and expected ?
Thanks !