VAE training sample script

zhuliyi0 commented 1 year ago

I believe the current lack of easy access to VAE training is stopping diffusion models from disrupting even more industries.

I'm talking about consistent details on things that are less represented in the original training data. 64x64 res can only carry so much detail. Very often I get good result from latent space (by checking the low-res intermedia image) before the final image is ruined by bad details. No prompting or finetuning or controlnet could solve this issue, I tried, and l know lots of other people tried, and most of them are trying without realising that the problem cannot be solved unless the thing that produces the final details can be trained with their domain data.

Right now VAE cannot be easily trained, at least not by someone like me who is not very good at math and python, so there is definitly a demand here. May I hope there will be a sample script based on diffusors to start with? I tried mess with the ones in compvis repo but to no avail. Thanks in advance!

patrickvonplaten commented 1 year ago

Currently I don't have the bandwidth to dive deeper into this, but I agree an easy training script for VAEs would make sense :-)

Let's see if the community has time for it!

aandyw commented 1 year ago

Definitely would love to dive deeper into this but would love some guidance if possible.

aandyw commented 1 year ago

Update: VAE training script runs successfully but I'll need to test on a full dataset and evaluate the results.

@zhuliyi0 Is there a dataset you would like me to try fine-tuning on? Preferably one hosted on hugging face?

zhuliyi0 commented 1 year ago

wow super cool! I was planning to train VAE to re-create certain architecture styles with consistent details, so I found this dataset on HF:

https://huggingface.co/datasets/Xpitfire/cmp_facade

Not a big dataset though, not sure if it works for you. Also there are images of extreme aspect ratio. Let me know if there are more specific requirement on the dataset and I will try to find/assemble a better one.

aandyw commented 1 year ago

@zhuliyi0 No worries and thanks for responding. Might be a little busy this week but I'll try out with the new dataset and see if the VAE is improving in terms of learning the new data.

zhuliyi0 commented 1 year ago

I got the script to run, but looks like my 12G VRAM is far from enough. I assume vram will go down once adam8bit and other optimizations is in place?

aandyw commented 1 year ago

@zhuliyi0 Perhaps but I can't really confirm anything at the moment. I'm basing hardware requirements on the docs (https://huggingface.co/docs/diffusers/training/text2image):

Using gradient_checkpointing and mixed_precision, it should be possible to finetune the model on a single 24GB GPU. For higher batch_size’s and faster training, it’s better to use GPUs with more than 30GB of GPU memory.

But this is obviously for training the Stable Diffusion model so the requirements will be different for sure.

At this time, I'm trying to confirm that the AutoencoderKL is indeed being fine-tuned with reasonable performance before actually implementing further techniques like EMA weights, MSE focused loss reconstruction + EMA weights, etc. (details are here: https://huggingface.co/stabilityai/sd-vae-ft-mse-original).

If you would like to work on this PR together I would appreciate the help since I maybe a little MIA for the next 2 weeks at most.

zhuliyi0 commented 1 year ago

I am a total newbie on python and ML. I am still trying to run the script on my local GPU, right now the OOM is gone after I stick to the arguments you provided in the script, vram and training speed is fine, but there is an error when saving validation image, basicly says an image file inside a wandb temp folder cannot be found. I checked and there is no such folder. Don't know how to use wandb to debug this one.

Colab seems to be running without error, but the speed is a bit slow compared to my local GPU, probably normal for T4. From validation images, I see signs of improvement of image details I was talking about, will validate with inferencing after a reasonable sized training has finished.

zhuliyi0 commented 1 year ago

I got training to run on my local GPU on Windows. The directory error was due to path naming convention in Windows. Again from validation images I can see it was learning. The loss was also going down.

I noticed there is a vram leak in log_validation function when the number of test image is 5 or above. I also failed to use the trained vae inside a1111 for inferencing, giving error "Missing key(s) in state_dict“.

aandyw commented 1 year ago

Hey @zhuliyi0 , thanks for taking the time to test things. The script is definitely not perfect yet but I'll work on the things you mentioned. In terms of transferring the VAE over to a1111 I'm not quite sure about that. I haven't played around with a1111 so I would need some time.

My current focus will be to clean up the script and implement the memory saving techniques to improve training. Then I'll see how we can make the VAE transferrable to a1111.

zhuliyi0 commented 1 year ago

Totally understand that the script wouldn't be perfect at this point. I am glad to help whenever I can. I will try using pipeline to test inference performance. @Pie31415

zhuliyi0 commented 1 year ago

here is a training test run:

https://wandb.ai//zhuliyi0/goa_5e5/reports/VAE-training-test--Vmlldzo0ODYzMzcx

Also did a quick inference test using a finetuned model that was trained on the same dataset, compare results with the default and trained VAE. I can confirm VAE is adding details, making the image better.

Another issue: the output from trained VAE looks white-washed. This happens on both sd15 and the finetuned model. I had to do some brightness and contrast change to the image. The validation images during training do not have this issue.

aandyw commented 1 year ago

here is a training test run:

https://wandb.ai//zhuliyi0/goa_5e5/reports/VAE-training-test--Vmlldzo0ODYzMzcx

Your wandb experiment seems to be private/locked.

I can confirm VAE is adding details, making the image better.

Are you referring to the default VAE or custom trained one? If it is a custom trained one can you provide a link to the weights? It'll be extremely beneficial to have some results to compare to when I'm fixing up experiments for the script.

Another issue: the output from trained VAE looks white-washed. This happens on both sd15 and the finetuned model. I had to do some brightness and contrast change to the image. The validation images during training do not have this issue.

Hmm yeah, it may be how we're training the VAE. I'll take a look over the weekend. Most likely the substantial changes will have to be done this weekend since I'm a little preoccupied before then.

Thanks a lot for your patience though. 🤗

zhuliyi0 commented 1 year ago

I made the project public. And the weight file:

https://drive.google.com/file/d/1gTQqWuVA7m7GYIStVbulYS-tN_CMY-PM/view?usp=sharing

Some inference image that shows the white-wash issue, using VAE at step 4k - 40k, gradually getting worse:

https://drive.google.com/drive/folders/16ivRLiLgb7dDixfFbNIL7vf_wNe9BaRO?usp=sharing

ThibaultCastells commented 1 year ago

Hello, This project is really cool, thank you! I noticed a potential mistake in the code: the kl loss is applied on the output, but I think it should be applied on the latent space if I understood correctly (I may be wrong, I am not an expert of VAE training). However using it gives me bad results, I think it is because it changes too much the latent space organization (in the end I use it with a really small coefficient).

The lpips loss gives great results however (without it, the image tends to become too 'smooth'). I used this library. I hope this helps!

    lpips_loss_fn = lpips.LPIPS(net='alex').to(accelerator.device)

    for epoch in range(first_epoch, args.num_train_epochs):
        vae.train()
        train_loss = 0.0
        for step, batch in enumerate(train_dataloader):
            with accelerator.accumulate(vae):
                target = batch["pixel_values"].to(weight_dtype)

                # https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/autoencoder_kl.py
                posterior = vae.encode(target).latent_dist
                z = posterior.mode()
                pred = vae.decode(z).sample

                kl_loss = posterior.kl().mean()
                mse_loss = F.mse_loss(pred, target, reduction="mean")
                lpips_loss = lpips_loss_fn(pred, target).mean()

                logger.info(f'mse:{mse_loss.item()}, lpips:{lpips_loss.item()}, kl:{kl_loss.item()}')

                loss = mse_loss + args.lpips_scale * lpips_loss + args.kl_scale * kl_loss

                # Gather the losses across all processes for logging (if we use distributed training).
                avg_loss = accelerator.gather(loss.repeat(args.train_batch_size)).mean()
                train_loss += avg_loss.item() / args.gradient_accumulation_steps

                accelerator.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

aandyw commented 1 year ago

Hello, This project is really cool, thank you! I noticed a potential mistake in the code: the kl loss is applied on the output, but I think it should be applied on the latent space if I understood correctly (I may be wrong, I am not an expert of VAE training). However using it gives me bad results, I think it is because it changes too much the latent space organization (in the end I use it with a really small coefficient).

The lpips loss gives great results however (without it, the image tends to become too 'smooth'). I used this library. I hope this helps!

    lpips_loss_fn = lpips.LPIPS(net='alex').to(accelerator.device)

    for epoch in range(first_epoch, args.num_train_epochs):
        vae.train()
        train_loss = 0.0
        for step, batch in enumerate(train_dataloader):
            with accelerator.accumulate(vae):
                target = batch["pixel_values"].to(weight_dtype)

                # https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/autoencoder_kl.py
                posterior = vae.encode(target).latent_dist
                z = posterior.mode()
                pred = vae.decode(z).sample

                kl_loss = posterior.kl().mean()
                mse_loss = F.mse_loss(pred, target, reduction="mean")
                lpips_loss = lpips_loss_fn(pred, target).mean()

                logger.info(f'mse:{mse_loss.item()}, lpips:{lpips_loss.item()}, kl:{kl_loss.item()}')

                loss = mse_loss + args.lpips_scale * lpips_loss + args.kl_scale * kl_loss

                # Gather the losses across all processes for logging (if we use distributed training).
                avg_loss = accelerator.gather(loss.repeat(args.train_batch_size)).mean()
                train_loss += avg_loss.item() / args.gradient_accumulation_steps

                accelerator.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

Thanks for the feedback, that definitely might be the case. I'll take a look and make the necessary changes. Thanks again.

aandyw commented 1 year ago

@zhuliyi0 I updated the PR with @ThibaultCastells 's code. Can you give your training another try and let us know the results? (e,.g, is the white-washing issue improved)

Also, I took a look at the VRAM issue you mentioned with test_images >= 5. I can't seem to reproduce the issue can you give more details on this if you're still experiencing this issue?

@ThibaultCastells I've credited the recent commit to you and I plan to mention your contribution in the PR as well.

aandyw commented 1 year ago

@patrickvonplaten Do you mind giving the PR a look over when you're free?

ThibaultCastells commented 1 year ago

@Pie31415 thank you very much! I will let you know if I have other improvement suggestions

ThibaultCastells commented 1 year ago

By the way:

However using it gives me bad results, I think it is because it changes too much the latent space organization (in the end I use it with a really small coefficient)

With a scale coefficient around $1e^{-7}$ and a training long enough (using my own dataset), the image quality first got much worst and then came back to normal, so my assumption about 'latent space reorganization' was good I think. The kl loss went from >20,000 to ~100 when it converges.

zhuliyi0 commented 1 year ago

@Pie31415 I re-run a training with new script, the result was conceivably no different. The white wash issue still exist, the same as previous. Seems like the training gradually makes the contrast lower and brightness higher, but not by much.

@ThibaultCastells do you mean "learning rate" when you say "coefficient"?

ThibaultCastells commented 1 year ago

No I meant the coefficient that multiplies the loss term (kl_scale): loss = mse_loss + args.lpips_scale * lpips_loss + args.kl_scale * kl_loss

Note that by default kl_scale and lpips_scale are 0, so if you didn't change it you won't see any difference (I suggest to use lpips_scale = 0.1, as this is the value used to finetune the vae of SD).

NoahRe1 commented 1 year ago

I noticed that there is no transforms.Normalize([0.5], [0.5]) applied to the images in the training script, and the output images seem to be correct. However, in other model training scripts, normalization is performed before using VAE. Is it an error in other scripts?

aandyw commented 1 year ago

@ThibaultCastells Do you have any thoughts about why the VAE might be outputting white washed reconstructions? I seem to have seen some Civitai models that had a similar issue. Not sure how it was resolved though.

aandyw commented 1 year ago

I noticed that there is no transforms.Normalize([0.5], [0.5]) applied to the images in the training script, and the output images seem to be correct. However, in other model training scripts, normalization is performed before using VAE. Is it an error in other scripts?

You're right. A blunder on my part. I guess it must have been removed when I was playing around with things and forgot to put it back. Thanks for the catch

ThibaultCastells commented 1 year ago

@Pie31415 I am not too surprised that this issue happens when using only the mse loss, because this is a very different training configuration than in the paper, so we don't know what to expect in this case. Therefore I would like to confirm that @zhuliyi0 changed the default value of the scale coefficients of the loss when he checked the new code. And if so, what value was used?

Note that when they finetune the vae for SD they only finetune the decoder, that's probably why they do not use kl loss (they do not need it since the decoder does not affect the latent space).

Also, not related but is it normal that there is no .eval() when evaluating the model (and therefore another .train() after evaluation)? Is it handled by the accelerator.unwrap_model function?

aandyw commented 1 year ago

@ThibaultCastells I'm wondering if it's a better idea if we finetune only the decoder.

https://huggingface.co/stabilityai/sd-vae-ft-mse-original Reading through the above model card it seems like the reasoning is to maintain compatibility with existing models which could explain why @zhuliyi0 was having issues loading the vae into a1111.

Also, not related but is it normal that there is no .eval() when evaluating the model (and therefore another .train() after evaluation)? Is it handled by the accelerator.unwrap_model function?

Not sure. I've adapted the code from the previous train scripts but the difference was that the unet, vae, etc. would be unwrapped and feed into the SD Pipeline which I assume does something similar to model.eval() for inference. Here I'm not feeding to the SD Pipeline so I'm not sure if triggering vae_model.eval() would change anything.

aandyw commented 1 year ago

Updates:

~~Removed random crop from transforms~~
Added in Normalization (thanks @NoahRe1)
Wandb now logs step_loss, lr, mse, lpips, and kl
kl_scale default set to 1e-6 (based on original LDM code)
lpips_scale default set to 0.1 (https://huggingface.co/stabilityai/sd-vae-ft-mse)

todo

add FID score
try out NLL loss instead of MSE since original LDM uses NLL (https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/vqperceptual.py)

~~Currently running model...~~ Details:

mixed_precision="no"
pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" 
dataset_name="Xpitfire/cmp_facade" 
train_batch_size=1
gradient_accumulation_steps=4
gradient_checkpointing=true
kl_scale=1e-6 (default)
lpips_scale=1e-3

Results

Run summary:
 kl 112.45438
lpips 0.04537
lr 0.0001
mse 0.00167
step_loss 0.00183

zhuliyi0 commented 1 year ago

I tried the new scripts with kl and ipips not zero, and white wash isssue seems to be gone, however the test images are blurred and over-saturated. Still tunning more parameters. Will try the new default values too. I kept gradient accmulation step to 1 for limited vram, not sure how much of an impact that would be.

zhuliyi0 commented 1 year ago

About whether to only train the decoder: the validation images during training does not have any issue that the test image had, white-wash or blur. Wonder if this is because of training on the encoder part, making the latent from vae encoder deviates from the text encoder?

Also I want to make sure is there requirement on the folder structure of training data? I see comments under --train_data_dir saying things about folder structure but assume that's just leftover code from other training scripts that require text prompt?

It might be helpful to dedicate a validation folder with specific images because I have noticed that some images start off much worse than others, so preferably monitored more closely during training. Will confirm this observation.

ThibaultCastells commented 1 year ago

@zhuliyi0

I tried the new scripts with kl and lpips not zero, and white wash issue seems to be gone

Great to hear that!

however the test images are blurred and over-saturated. Still tunning more parameters. Will try the new default values too.

For how long did you train? As I mentioned, for me it took some time to improve.

I kept gradient accmulation step to 1 for limited vram, not sure how much of an impact that would be.

In my case, I first tried to maximize the batch size (max was 2, with image size 512 😢), then I picked the gradient accumulation in order to have the equivalent of a batch size of 32, which is usually a good number from my experience. So I use 16 accumulation.

Also I want to make sure is there a requirement on the folder structure of training data? I see comments under --train_data_dir saying things about folder structure but assume that's just leftover code from other training scripts that require text prompt?

Yes I think it's a leftover code. Another leftover (which does not matter much) is the use of the word 'noise' for the vae input, probably from the unet script:

for _, sample in enumerate(test_dataloader):
    noise = sample["pixel_values"].to(weight_dtype)
    recon_imgs = vae_model(noise).sample
    images.append(
        torch.cat([sample["pixel_values"].cpu(), recon_imgs.cpu()], axis=0)
    )

@Pie31415

I'm wondering if it's a better idea if we finetune only the decoder. https://huggingface.co/stabilityai/sd-vae-ft-mse-original Reading through the above model card it seems like the reasoning is to maintain compatibility with existing models which could explain why @zhuliyi0 was having issues loading the vae into a1111.

I personally like to have the possibility to train both. But it would depend on each person's motivation to train the vae. Why not add the option with a parameter --decoder_only or something like that (keep in mind that when training the decoder only, you should remove the kl loss since the decoder does not affect the latents)? I do not know what a1111 is but I doubt the encoder re-training has anything to do with code errors when loading the vae somewhere else, as we just modify the weights values 🤔

Removed random crop from transforms

Oh I did it at the beginning because my images are already cropped, but I was wondering why it's here. My assumption was that Resize does not resize as a square but as 'minimum side become the given size, and conserve the image ratio', which would have explained the Crop. I just checked the doc and my assumption seems correct:

size: Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).

So I think it's better to keep this for people who do not pre-crop their images (?)

lpips_scale default set to 1e-3 (based on https://github.com/cccntu/fine-tune-models/blob/main/run_finetune_vae.py)

Is there a reason why not to use the fine-tuning value used here? In my experiments, 0.1 worked well (I didn't try 1e-3 so cannot tell about it though).

add FID score

I would recommend using this library.

try out NLL loss instead of MSE since original LDM uses NLL (https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/losses/vqperceptual.py)

I already tried it but got really bad results (maybe due to hyper-parameters?). If you want to try by yourself, you can use the posterior.nll function directly, as it is already implemented there.

zhuliyi0 commented 1 year ago

@ThibaultCastells

For how long did you train? As I mentioned, for me it took some time to improve.

I ran multiple sessions for 4 - 24+ hrs, and kl loss drop from 1e6+ to 300-ish. Train loss drops normally too. The big difference between validation images and inference testing images is what's puzzling to me.

ThibaultCastells commented 1 year ago

The big difference between validation images and inference testing images is what's puzzling to me.

Can you tell me more about that? I didn't get what is the issue.

aandyw commented 1 year ago

Also I want to make sure is there requirement on the folder structure of training data? I see comments under --train_data_dir saying things about folder structure but assume that's just leftover code from other training scripts that require text prompt?

The --train_data_dir is if someone has a custom dataset not uploaded on hugging face. The script requires you to pass a dataset either with --dataset_name or --train_data_dir.

It might be helpful to dedicate a validation folder with specific images because I have noticed that some images start off much worse than others, so preferably monitored more closely during training. Will confirm this observation.

Yeah, I was thinking of adding a parameter later on to have a validation image folder but currently the --test_images x takes x samples from the train dataset that won't be used during training and uses it only for reconstruction.

zhuliyi0 commented 1 year ago

@ThibaultCastells

The big difference between validation images and inference testing images is what's puzzling to me.

The validation image looks ok, slightly blurred if look really close, no sign of overfitting all the way to the end. The test images from inference pipeline are much more blurred, clearly over-saturated, and more artifects pop out as training progress. The issue is more potent with bigger lr and large number of steps.

Again I am wondering about the encoder being trained: does that mean I need to redo the unet and text encoder finetuning with the new vae, so they can work together without issue? Or if the vae encoder training should be turned off to keep compatibility with existing checkpoint?

ThibaultCastells commented 1 year ago

The validation image looks ok, slightly blurred if look really close, no sign of overfitting all the way to the end. The test images from inference pipeline are much more blurred, clearly over-saturated, and more artifects pop out as training progress. The issue is more potent with bigger lr and large number of steps.

I have no idea where this comes from, I didn't observe this behavior 🤔

Again I am wondering about the encoder being trained: does that mean I need to redo the unet and text encoder finetuning with the new vae, so they can work together without issue? Or if the vae encoder training should be turned off to keep compatibility with existing checkpoint?

Yes, the latent space being modified you will need to fine-tune the unet as well. But no need to fine-tune the text encoder. But again, I think it is important to keep the 'decoder training only' optional. Because this is a vae training script, not decoder training script.

aandyw commented 1 year ago

It's exactly as @ThibaultCastells says. The latent space is being modified when we train the VAE. In the image below the encoder abstracts the pixel space into a latent space z which through conditional inputs (e.g. text prompts, masks, source images, etc.) the UNet learns to denoise from the noise sampled from the latent space z_T to something in the latent space z which once we obtain we can use our simultaneously trained decoder to decode it back into the pixel space (an image).

The reconstructed images you see on wandb are just reconstructed using the encoder-decoder and so the rest of the SD pipeline isn't involved. If you want to use this for inference then yes you'll have to fine-tune the unet as well.

aandyw commented 1 year ago

@ThibaultCastells I do think it is important to have the decoder fine-tuning be the main focus since a lot of people seem to be interested in "VAE" fine-tuning primarily for the decoder improvements for SD integration which I believe was the focus of @zhuliyi0's issue in the first place. I can definitely look into having configs for toggling encoder training.

I'm also not knowledgeable enough about how people finetune and publish models like counterfeit, dreamshaper, etc. so I'm not sure if their custom VAE training also involves an aspect of UNet training, that is, when they publish ckpt files for VAEs were they finetuned along with the UNet. If that's the case I think it would just be as simple as unfreezing the UNet which I also think can be added in as an optional parameter but one that should be required if the encoder and decoder are both being trained.

TLDR; we should think about including options for:

vae encoder-decoder + unet training;
decoder training (to maintain compatibility with existing SD models)

Curious about your thoughts on this.

aandyw commented 1 year ago

@ThibaultCastells

The big difference between validation images and inference testing images is what's puzzling to me.

The validation image looks ok, slightly blurred if look really close, no sign of overfitting all the way to the end. The test images from inference pipeline are much more blurred, clearly over-saturated, and more artifects pop out as training progress. The issue is more potent with bigger lr and large number of steps.

Again I am wondering about the encoder being trained: does that mean I need to redo the unet and text encoder finetuning with the new vae, so they can work together without issue? Or if the vae encoder training should be turned off to keep compatibility with existing checkpoint?

I've updated the repo with a few small changes today. Can you pull them and try it out?

ThibaultCastells commented 1 year ago

I do think it is important to have the decoder fine-tuning be the main focus since a lot of people seem to be interested in "VAE" fine-tuning primarily for the decoder improvements for SD integration which I believe was the focus of @zhuliyi0's issue in the first place. I can definitely look into having configs for toggling encoder training.

Okay it's a fair point. I think it's fine as long as there is an option to train both encoder and decoder somehow.

I'm also not knowledgeable enough about how people finetune and publish models like counterfeit, dreamshaper, etc. so I'm not sure if their custom VAE training also involves an aspect of UNet training, that is, when they publish ckpt files for VAEs were they finetuned along with the UNet. If that's the case I think it would just be as simple as unfreezing the UNet which I also think can be added in as an optional parameter but one that should be required if the encoder and decoder are both being trained.

This seems extremely unlikely to me. For 2 reasons:

it makes the training complex for nothing, as training them independently is not an issue
it would requires a lot of gpu memory. I doubt the average user can train both at the same time on his gpu. I mean, I am already limited to a batch size of 2.

I think it is much better to train the vae first, and then fine-tune the unet with the unet fine-tuning script if needed.

I've updated the repo with a few small changes today. Can you pull them and try it out?

I didn't have the time to try it, but I read the commits and everything looks good to me 👍🏼

ThibaultCastells commented 1 year ago

The validation image looks ok, slightly blurred if look really close, no sign of overfitting all the way to the end. The test images from inference pipeline are much more blurred, clearly over-saturated, and more artifects pop out as training progress. The issue is more potent with bigger lr and large number of steps.

@zhuliyi0 Oh, by test images you mean with the unet? I didn't catch that! Then yes it completely makes sense if you didn't fine-tune the unet.

ThibaultCastells commented 1 year ago

I think you may be interested in the new SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis paper as they re-train a VAE which is better than the one from the previous SD.

2.4 Improved Autoencoder Stable Diffusion is a LDM, operating in a pretrained, learned (and fixed) latent space of an autoencoder. While the bulk of the semantic composition is done by the LDM [38], we can improve local, high-frequency details in generated images by improving the autoencoder. To this end, we train the same autoencoder architecture used for the original Stable Diffusion at a larger batch-size (256 vs 9) and additionally track the weights with an exponential moving average. The resulting autoencoder outperforms the original model in all evaluated reconstruction metrics, see Tab. 3. We use this autoencoder for all of our experiments.

mae and large batch size seems important, but again there is the memory issue... I think it may be important to solve the mixed precision issue now, to reduce GPU RAM utilisation.

zhuliyi0 commented 1 year ago

Then it totally makes sense to have option of training/freezing the encoder part. A fully trained vae will give the trainer more freedom to experiment, while a partially trained vae would be valuable to keep compatibility with other finetuned models. I will give the new version a try, and redo the finetuning to see if they will work together.

ThibaultCastells commented 1 year ago

I won't be able to work on this project this weekend, but I think I may be able to solve the fp16 issue next week if it's due to what I think.

aandyw commented 1 year ago

I won't be able to work on this project this weekend, but I think I may be able to solve the fp16 issue next week if it's due to what I think.

Any ideas? Haven't touched it in a while but I might have some time next week to look into it.

ThibaultCastells commented 1 year ago

@Pie31415 I think it may be due to a dtype issue: at some point something may be in the wrong data type.

bghira commented 1 year ago

whatever is implemented i just wanted to propose that L2 or weight decay are optional so that the activation values don't grow too high as they did with SDXL 0.9 and 1.0 VAEs.

we might also want to look into more modern optimizers like Dadapt that allow us to set LR to 1.

zhuliyi0 commented 1 year ago

The current version shows mse kl and ipips loss in seperate graphs. Should all three be typical L-shaped graphs? Right now I am only seeing kl loss in that shape, and the other two jumping up and down a lot randomly with no noticable going down in mean value, too high learning rate / coefficient?

zhuliyi0 commented 1 year ago

BTW I came across this dataset:

https://www.kaggle.com/datasets/tompaulat/modernarchitecture?resource=download

This one is not on HF, but much better in quality, and is the type of data that is less present in base model training. The dataset I am using is something of similar content and quality, but only several hundrad images.

zhuliyi0 commented 1 year ago

@Pie31415

Also, I took a look at the VRAM issue you mentioned with test_images >= 5. I can't seem to reproduce the issue can you give more details on this if you're still experiencing this issue?

I found this line of code inside log_validation causes this issue:

reconstructions = vae_model(x).sample

If I skip this line, the issue will be gone. And number of test images does matter. The number 5 I said earlier is probably just for my VRAM amount. To me it looks like somehow the vae sampling cache in VRAM is not properly released before next loop begin. Guess it would be hard to fix it if not reproducable in your environment. I also noticed if I add a validation right before the training loop begin, I can reproduce this issue with only 1 test image.

Update: this is solved by adding torch.no_grad() before calling validation.

huggingface / diffusers

VAE training sample script #3726