huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.88k stars 962 forks source link

Model support: OmniGen #2593

Open Czxck001 opened 2 weeks ago

Czxck001 commented 2 weeks ago

OmniGen is a new image generation model that is built by tuning an existing Phi-3 model into a transformer for diffusion task. It appears to have next-level multi-modal capability, like incorporating images as inputs and refer the images in the prompt text and compose them into the generated image in a flexible way.

It's capable model, but the architecture is surprisingly simple. Besides regular patchify-ing as in DiT and using a SDXL VAE, looks like it only adds a few layers on top of a standard Phi-3 model and changed the attention mask for image and timestamp tokens. This means an implementation of this model can largely borrow from existing impls for Phi3, DiT and VAE.

@LaurentMazare I can give it a try. Let me know if you are already working on it :D.

LaurentMazare commented 2 weeks ago

Sounds like a nice model to support, feel free to give it a stab.