AI-Guru commented 2 years ago

Hi!

I am working on latent diffusion for audio and music. It seems to me that Diffusers 🧨 is the place to be! There is a feature I would like to request: Training AutoencoderKL (Variational Autoencoder).

What I would love to do, is training AutoencoderKL on square and non-square images, either with one or more than one channels. I checked the implementation, and it seems to me that due to its fully convolutional nature, this would be perfectly possible.

A good start would be a script/notebook that shows how to train AutoencoderKL on a Hugging Face dataset. On the long run it could be even a Trainer.

Yogiraj587 commented 2 years ago

@AI-Guru Can you please assign this issue.....????

patrickvonplaten commented 2 years ago

Hey @AI-Guru,

Cool idea! I'm not sure how much time our training experts will have for this (cc'ing @patil-suraj and @anton-l) but this would indeed be a very useful addition.

Also opening this one up to the Community in case anybody is interested :-)

patil-suraj commented 2 years ago

Nice idea,

But note that, training the AutoencoderKL is bit complicated and outside the scope of diffusers. It's complicated because the VAE in SD is trained with GAN objective and has multiple losses like lpips, discriminator loss which requires implementing extra modules. For training VAE I think the best resource is the taming-transformers repo, it's the repo which is uses to train the VAE in SD has all the components required for training implemented.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jbmaxwell commented 1 year ago

Did you get this running @AI-Guru? I'd be very interested if you did... and if so, any script you can share? :)

eeyrw commented 1 year ago

Nice idea,

But note that, training the AutoencoderKL is bit complicated and outside the scope of diffusers. It's complicated because the VAE in SD is trained with GAN objective and has multiple losses like lpips, discriminator loss which requires implementing extra modules. For training VAE I think the best resource is the taming-transformers repo, it's the repo which is uses to train the VAE in SD has all the components required for training implemented.

According https://huggingface.co/stabilityai/sd-vae-ft-mse, it seems to finetune VAE used in Stable Diffusion, only LPIPS and MSE loss are required and no discriminator loss required. Is that true?

patil-suraj commented 1 year ago

Yeah for fine-tuning think it didn't use the discriminator, but the vae is pre-trained using all those 3 losses, as you can see in taming-transformers

jbmaxwell commented 1 year ago

Just wondering; is there a way to load AutoencoderKL from a config.json, but without the pretrained weights? Or a way to initialize it? I've been trying to fine-tune it to my specific data, but it's struggling to improve beyond a certain level so I'd like to give it a shot "from scratch"... I can't find any way to do that (no AutoConfig -> from_config() for Diffusers).

Any help appreciated.

eeyrw commented 1 year ago

I do same thing with UNet like this:

        unet = UNet2DConditionModel.from_config({
            "_class_name": "UNet2DConditionModel",
            "_diffusers_version": "0.6.0",
            "act_fn": "silu",
            "attention_head_dim": 8,
            "block_out_channels": [
                320,
                640,
                1280,
                1280
            ],
            "center_input_sample": False,
            "cross_attention_dim": 768,
            "down_block_types": [
                "CrossAttnDownBlock2D",
                "CrossAttnDownBlock2D",
                "CrossAttnDownBlock2D",
                "DownBlock2D"
            ],
            "downsample_padding": 1,
            "flip_sin_to_cos": True,
            "freq_shift": 0,
            "in_channels": 4,
            "layers_per_block": 2,
            "mid_block_scale_factor": 1,
            "norm_eps": 1e-05,
            "norm_num_groups": 32,
            "out_channels": 4,
            "sample_size": 64,
            "up_block_types": [
                "UpBlock2D",
                "CrossAttnUpBlock2D",
                "CrossAttnUpBlock2D",
                "CrossAttnUpBlock2D"
            ]
            }
        )

jbmaxwell commented 1 year ago

Hmm... I thought I tried from_config on the AutoencoderKL, but maybe not... I'll give it a shot, thanks.

@eeyrw, yeah absolutely. Thanks for that... I think I tried using AutoConfig followed by from_config(), but when I didn't find an AutoConfig in Diffusers I just stupidly gave up on from_config() as well. (-_Q)

Thanks again.

zhuliyi0 commented 1 year ago

Hmm... I thought I tried from_config on the AutoencoderKL, but maybe not... I'll give it a shot, thanks.

@eeyrw, yeah absolutely. Thanks for that... I think I tried using AutoConfig followed by from_config(), but when I didn't find an AutoConfig in Diffusers I just stupidly gave up on from_config() as well. (-_Q)

Thanks again.

did you get it working? I am very interested in training VAE too. I believe this is one part that is under developed and is stopping stable diffusion from disrupting even more industries.

I'm talking about consistent details on things that are less represented in the original training data. 64x64 res can only carry so much detail. Very often I get good result from latent space (by checking the low-res intermedia image) before the final image is ruined by bad details. No prompting or finetuning will solve this issue, I tried, and l know lots of other people tried, and most of them are trying without realising that the problem cannot be solved unless the thing that produces the final details can be trained with their domain data. VAE cannot be easily trained, at least not by someone like me who is not very good at math and python, so there is definitly a demand here.

May I hope there will be a sample script based on diffusors to start with? I tried mess with the ones in compvis repo but to no avail. Thanks in advance!

zhuliyi0 commented 1 year ago

I created a feature request here https://github.com/huggingface/diffusers/issues/3726

betterze commented 8 months ago

Is there anyone working on it? Could you share the example codes? thx

Dexter-Wang commented 7 months ago

Train AutoencoderKL is just the same as train other model. Train 5 Epoch, get loss 0.0034

dataset_path = "huggan/smithsonian_butterflies_subset" set image_size = 128, train_batch_size = 8, load dataset and transfer to dataloader

--- following is the code for train -----

AeNet = AutoencoderKL().to("cuda") optimizer = torch.optim.AdamW(AeNet.parameters(), lr=Args.learning_rate)
loss_func = nn.MSELoss().to("cuda") scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10,40], gamma=0.5)

for epoch in range(num_epochs): loss = 0 total_loss = 0
for step, bhData in enumerate(dataloader):
if (step%10==0): print("step:",step)

    imgsIn = bhData["image"].to("cuda")     

    imgsOut = AeNet(imgsIn,return_dict=False)[0]     
    loss = loss_func(imgsIn, imgsOut)          
    loss.backward()        

    #---f.backward,optimize---------------------------------------  
    optimizer.step()                   
    optimizer.zero_grad() 
    total_loss+=loss   

total_loss /= len(dataloader)          # 計算1epoch之平均loss 
scheduler.step()                       # 過程中之學習參數調整   
print("Epoch:%4d, loss:%8.4f" % (epoch, total_loss.item()))

youyinnn commented 6 months ago

Train AutoencoderKL is just the same as train other model. Train 5 Epoch, get loss 0.0034

dataset_path = "huggan/smithsonian_butterflies_subset" set image_size = 128, train_batch_size = 8, load dataset and transfer to dataloader

--- following is the code for train ----- AeNet = AutoencoderKL().to("cuda") optimizer = torch.optim.AdamW(AeNet.parameters(), lr=Args.learning_rate) loss_func = nn.MSELoss().to("cuda") scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10,40], gamma=0.5)

for epoch in range(num_epochs): loss = 0 total_loss = 0 for step, bhData in enumerate(dataloader): if (step%10==0): print("step:",step)
    imgsIn = bhData["image"].to("cuda")     

    imgsOut = AeNet(imgsIn,return_dict=False)[0]     
    loss = loss_func(imgsIn, imgsOut)          
    loss.backward()        

    #---f.backward,optimize---------------------------------------  
    optimizer.step()                   
    optimizer.zero_grad() 
    total_loss+=loss   

total_loss /= len(dataloader)          # 計算1epoch之平均loss 
scheduler.step()                       # 過程中之學習參數調整   
print("Epoch:%4d, loss:%8.4f" % (epoch, total_loss.item())) 

Are you sure this is work? VAE has multiple losses to be considered.

Dexter-Wang commented 6 months ago

Yes, it works. In AutoecoderKL.py there is a "def forward()", it is as following. def forward( self, sample: torch.FloatTensor, sample_posterior: bool = False, return_dict: bool = True, generator: Optional[torch.Generator] = None, ) -> Union[DecoderOutput, torch.FloatTensor]:

set sample_posterior = False, AutoecoderKL.py becomes AE, it's default value is False. So, it is AE.

Dexter-Wang commented 6 months ago

By the way, UNet seems to be a good model for AE, since it converges very fast and loss is very small. However, it can not work as AE. The decoder in UNet can not works independently, since the residual of decoder comes from encoder. Therefore, the UNet can not work as AE.

Anyway, if the residual of decoder were not taken from encoder, it can works as AE, but converge is slow and loss is not small.

youyinnn commented 6 months ago

By the way, UNet seems to be a good model for AE, since it converges very fast and loss is very small. However, it can not work as AE. The decoder in UNet can not works independently, since the residual of decoder comes from encoder. Therefore, the UNet can not work as AE.

Anyway, if the residual of decoder were not taken from encoder, it can works as AE, but converge is slow and loss is not small.

I thought the AutoencoderKL was designed as VAE. But I see what you mean. You literally just train an AE from the reconstruction loss. However, the performance of this way remains to be discussed. NVM, I found the VAE training script from the diffuser.

Leminhbinh0209 commented 6 months ago

I did fine-tuning code for VAE of SD in this repo. Any contribution is welcome.

Dexter-Wang commented 6 months ago

By the way, to train an autoencoderKL for 512x512 image need 25GB GPU. So I reduce to 128x128, it only use to prove it is workable.

"autoencoderKL encoder + forward diffusion process + autoencoderKL decoder" is one kind of VAE. Please refer to https://huggingface.co/docs/diffusers/tutorials/basic_training Train a diffusion model.

humanely commented 6 months ago

Nice idea,

But note that, training the AutoencoderKL is bit complicated and outside the scope of diffusers. It's complicated because the VAE in SD is trained with GAN objective and has multiple losses like lpips, discriminator loss which requires implementing extra modules. For training VAE I think the best resource is the taming-transformers repo, it's the repo which is uses to train the VAE in SD has all the components required for training implemented.

How to load the VAE of CompVis in HF?

I get following error:


Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 432, in load_config
    config_dict = cls._dict_from_json_file(config_file)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 554, in _dict_from_json_file
    text = reader.read()
           ^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/diffusers/models/modeling_utils.py", line 567, in from_pretrained
    config, unused_kwargs, commit_hash = cls.load_config(
                                         ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 436, in load_config
    raise EnvironmentError(f"It looks like the config file at '{config_file}' is not a valid JSON file.")
OSError: It looks like the config file at 'ae-model.ckpt' is not a valid JSON file.

gitlabspy commented 5 months ago

I did fine-tuning code for VAE of SD in this repo. Any contribution is welcome.

Nice. But GAN loss I think it's important and it's not implemented.

pwwwyyy commented 2 months ago

By the way, UNet seems to be a good model for AE, since it converges very fast and loss is very small. However, it can not work as AE. The decoder in UNet can not works independently, since the residual of decoder comes from encoder. Therefore, the UNet can not work as AE. Anyway, if the residual of decoder were not taken from encoder, it can works as AE, but converge is slow and loss is not small.

I thought the AutoencoderKL was designed as VAE. But I see what you mean. You literally just train an AE from the reconstruction loss. However, the performance of this way remains to be discussed. NVM, I found the VAE training script from the diffuser.

can you give the VAE training script from the diffuser? I can not find , thank u

huggingface / diffusers

[Community] Training AutoencoderKL #894

--- following is the code for train -----

--- following is the code for train ----- AeNet = AutoencoderKL().to("cuda") optimizer = torch.optim.AdamW(AeNet.parameters(), lr=Args.learning_rate) loss_func = nn.MSELoss().to("cuda") scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10,40], gamma=0.5)