huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.47k stars 5.46k forks source link

Kandinsky 2.1 #2985

Closed patrickvonplaten closed 1 year ago

patrickvonplaten commented 1 year ago

Model/Pipeline/Scheduler description

Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffusion, while introducing some new ideas.

As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.

For diffusion mapping of latent spaces we use transformer with num_layers=20, num_heads=32 and hidden_size=2048.

Open source status

Provide useful links for the implementation

GitHub: https://github.com/ai-forever/Kandinsky-2

@cene555 @razzant

patrickvonplaten commented 1 year ago

Model handles multi-lingual much nicer than previous models. Would be a super nice addition in case anybody wants to give it a try!

piEsposito commented 1 year ago

Can we start build just building it as a community pipeline? Or are you aiming to fully integrate it right from the start?

sayakpaul commented 1 year ago

Given the novelties of the model and also performance, we were hoping for a core integration.

piEsposito commented 1 year ago

Given the novelties of the model and also performance, we were hoping for a core integration.

I can try. I'll probably ask for some guidance on the design, as I never done that kind of contribution for diffusers, but I think it will be fun.

But hey, no time like the present right.

patrickvonplaten commented 1 year ago

We can surely help (also hopefully in a more reactive manner than the textual inversion PR :sweat_smile: )

ayushtues commented 1 year ago

Hey @sayakpaul @patrickvonplaten @piEsposito , I was learning about diffusion models and implementing them from scratch here and would like to help in integrating this model into diffusers to learn more about these models, if possible!

patrickvonplaten commented 1 year ago

Hey @ayushtues!

That's very cool, feel free to open a PR to add it. These docs should be helpful:

user425846 commented 1 year ago

Is there any update on integrating Kandinsky?

ayushtues commented 1 year ago

@user425846 yet to start working on this, will begin soon

ayushtues commented 1 year ago

Adding a notion link here to track progress

patrickvonplaten commented 1 year ago

Very cool notion doc! Feel free to open a draft PR as soon as you have a part working, more than happy to guide you into the correct direction :-)

ayushtues commented 1 year ago

Thanks @patrickvonplaten! I'll first try to use existing diffusers/HF building blocks to build the Kandinsky model architecture, with minimal changes and load the pretrained weights. Will focus on the scheduler and the whole end to end pipeline in a later stage.

As far as I can see, the model has the following parts

  1. text_encoder - simple XLMRoberta model, can be used as such from HF
  2. prior - Uses a PriorTransformer like DALLE-2, I can see a PriorTransformer already present in Diffusers, will try to use it
  3. clip - CLIP model, already implemented in transformers CLIPTextModel, Kandinsky uses the openAI implementation - https://github.com/openai/CLIP, will need to see if the weights of the version they used can be loaded in the HF model directly
  4. model - UNET based latent diffusion model, already implemented in diffusers
  5. image_encoder/decoder - MOVQ, which is a VQVAE with a different normalization, might need to change some of the existing VQVAE code in diffusers for this

I had a question, being new to the diffusers philosophy, how do I add new models which don't directly use the existing building blocks present in models/, like say for the image decoder, which requires some changes to the vanilla VQVAE, how should I go about it, should I add a new model file in models/ or just copy the VQVAE code in the pipeline code and make the additional changes there. Should the bulk of the architecture changes be in the pipeline code itself, or in a single new file in /models/? my guess is adding it to the pipeline makes more sense, keeping /models/ as general as possible, but would love to hear opinions on this.

sayakpaul commented 1 year ago

Very well laid points, @ayushtues!

which requires some changes to the vanilla VQVAE, how should I go about it, should I add a new model file in models/ or just copy the VQVAE code in the pipeline code and make the additional changes there.

If the changes are not major, we just parameterize the model utilities to accommodate them. See https://github.com/huggingface/diffusers/pull/2407.

But for major changes, what you said here makes a lot of sense:

Should the bulk of the architecture changes be in the pipeline code itself, or in a single new file in /models/? my guess is adding it to the pipeline makes more sense, keeping /models/ as general as possible, but would love to hear opinions on this.

yiyixuxu commented 1 year ago

@ayushtues

In terms of model components, I think only the MOVQ decoder is new and needs to be added - the rest components can either be used out of the box or with small modifications.

So maybe the best strategy is to get the decoder working first?

infamix commented 1 year ago

Converting the prior should be relatively easy, I did most of the job a few days back, the only problem is that I couldn't figure out how some of the layers correspond to one-another. The unet should also be easy.

yiyixuxu commented 1 year ago

I started a PR here https://github.com/huggingface/diffusers/pull/3308

@ayushtues If you're still interested, I would love to work with you on that

ayushtues commented 1 year ago

@yiyixuxu yes sure, would love to work with you on this!

yiyixuxu commented 1 year ago

@ayushtues awesome! Do you want to start on the decoder? it's a little bit less intertwined with the rest of the model, so can be a separate PR

we already have an implementation in MUSE with the same code style as diffusers but need to add to diffusers using as many existing components as possible https://github.com/huggingface/open-muse/blob/main/muse/modeling_movq.py

ayushtues commented 1 year ago

Sure thing @yiyixuxu! I'll branch out from your PR and start working on it there