Closed patrickvonplaten closed 1 year ago
Model handles multi-lingual much nicer than previous models. Would be a super nice addition in case anybody wants to give it a try!
Can we start build just building it as a community pipeline? Or are you aiming to fully integrate it right from the start?
Given the novelties of the model and also performance, we were hoping for a core integration.
Given the novelties of the model and also performance, we were hoping for a core integration.
I can try. I'll probably ask for some guidance on the design, as I never done that kind of contribution for diffusers, but I think it will be fun.
But hey, no time like the present right.
We can surely help (also hopefully in a more reactive manner than the textual inversion PR :sweat_smile: )
Hey @sayakpaul @patrickvonplaten @piEsposito , I was learning about diffusion models and implementing them from scratch here and would like to help in integrating this model into diffusers to learn more about these models, if possible!
Hey @ayushtues!
That's very cool, feel free to open a PR to add it. These docs should be helpful:
Is there any update on integrating Kandinsky?
@user425846 yet to start working on this, will begin soon
Very cool notion doc! Feel free to open a draft PR as soon as you have a part working, more than happy to guide you into the correct direction :-)
Thanks @patrickvonplaten! I'll first try to use existing diffusers/HF building blocks to build the Kandinsky model architecture, with minimal changes and load the pretrained weights. Will focus on the scheduler and the whole end to end pipeline in a later stage.
As far as I can see, the model has the following parts
text_encoder
- simple XLMRoberta model, can be used as such from HFprior
- Uses a PriorTransformer
like DALLE-2, I can see a PriorTransformer
already present in Diffusers, will try to use itclip
- CLIP model, already implemented in transformers CLIPTextModel
, Kandinsky uses the openAI implementation - https://github.com/openai/CLIP, will need to see if the weights of the version they used can be loaded in the HF model directlymodel
- UNET based latent diffusion model, already implemented in diffusersimage_encoder/decoder
- MOVQ, which is a VQVAE with a different normalization, might need to change some of the existing VQVAE code in diffusers for thisI had a question, being new to the diffusers philosophy, how do I add new models which don't directly use the existing building blocks present in models/
, like say for the image decoder, which requires some changes to the vanilla VQVAE, how should I go about it, should I add a new model file in models/
or just copy the VQVAE code in the pipeline code and make the additional changes there.
Should the bulk of the architecture changes be in the pipeline code itself, or in a single new file in /models/
? my guess is adding it to the pipeline makes more sense, keeping /models/
as general as possible, but would love to hear opinions on this.
Very well laid points, @ayushtues!
which requires some changes to the vanilla VQVAE, how should I go about it, should I add a new model file in models/ or just copy the VQVAE code in the pipeline code and make the additional changes there.
If the changes are not major, we just parameterize the model utilities to accommodate them. See https://github.com/huggingface/diffusers/pull/2407.
But for major changes, what you said here makes a lot of sense:
Should the bulk of the architecture changes be in the pipeline code itself, or in a single new file in /models/? my guess is adding it to the pipeline makes more sense, keeping /models/ as general as possible, but would love to hear opinions on this.
@ayushtues
In terms of model components, I think only the MOVQ decoder is new and needs to be added - the rest components can either be used out of the box or with small modifications.
So maybe the best strategy is to get the decoder working first?
Converting the prior should be relatively easy, I did most of the job a few days back, the only problem is that I couldn't figure out how some of the layers correspond to one-another. The unet should also be easy.
I started a PR here https://github.com/huggingface/diffusers/pull/3308
@ayushtues If you're still interested, I would love to work with you on that
@yiyixuxu yes sure, would love to work with you on this!
@ayushtues awesome! Do you want to start on the decoder? it's a little bit less intertwined with the rest of the model, so can be a separate PR
we already have an implementation in MUSE with the same code style as diffusers but need to add to diffusers using as many existing components as possible https://github.com/huggingface/open-muse/blob/main/muse/modeling_movq.py
Sure thing @yiyixuxu! I'll branch out from your PR and start working on it there
Model/Pipeline/Scheduler description
Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffusion, while introducing some new ideas.
As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.
For diffusion mapping of latent spaces we use transformer with num_layers=20, num_heads=32 and hidden_size=2048.
Open source status
Provide useful links for the implementation
GitHub: https://github.com/ai-forever/Kandinsky-2
@cene555 @razzant