Model support: OmniGen - Githubissues

OmniGen is a new image generation model that is built by tuning an existing Phi-3 model into a transformer for diffusion task. It appears to have next-level multi-modal capability, like incorporating images as inputs and refer the images in the prompt text and compose them into the generated image in a flexible way.

GitHub: OmniGen/OmniGen
HuggingFace: Shitao/OmniGen-v1
Paper: 2409.11340

It's capable model, but the architecture is surprisingly simple. Besides regular patchify-ing as in DiT and using a SDXL VAE, looks like it only adds a few layers on top of a standard Phi-3 model and changed the attention mask for image and timestamp tokens. This means an implementation of this model can largely borrow from existing impls for Phi3, DiT and VAE.

@LaurentMazare I can give it a try. Let me know if you are already working on it :D.

huggingface / candle

Model support: OmniGen #2593