madebyollin / taesd

Tiny AutoEncoder for Stable Diffusion
MIT License
495 stars 27 forks source link

Generated image in blur while training Autoencodertiny Decoder #19

Open phyllispeng123 opened 1 month ago

phyllispeng123 commented 1 month ago

Hi, this is Phyllis. Very appreciate your marvelous work on SD-VAE compression and acceleration !!!! I am currently working on Autoencodertiny decoder training from scratch using LDM training structure, but I find that the generated images are not very clear. The training is completely the same as LDM training apart from your extra loss distance(disc(real).mean(), disc(fake).mean()) in my decoder generator (the extra loss indeed helped with stability and FID ,many thanks!!!! ). I train the decoder using SD1.5 encoder output as my input for 190k steps with batch_size=4, lr = 1e-4, but generated images are still not clear. Would you mind give some hints (loss used, steps for training, any finetune stage? ) on how to align with your TAESD results?

this is TAESD result 20240617-152624

this is MY result tiny_tiny_generated_images_199000_59d1bee81c7d1232debf

madebyollin commented 1 month ago

Since the output is a mix of blurry and sharp portions, it looks like the scale of your adversarial loss / gradients is probably too low compared to the reconstruction (MSE/MAE) loss / gradients.

You could try just scaling up the discriminator loss (10 * distance(disc(real).mean(), disc(fake).mean()) or something), or using the automatic scaling from https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/contperceptual.py#L32.

I also posted a simplified example training notebook using only adversarial loss here https://github.com/madebyollin/seraena/blob/main/TAESDXL_Training_Example.ipynb (which also does some automatic gradient scaling); you could try taking the adversarial loss from that and combining it with your reconstruction losses.