OmniGen is a new image generation model that is built by tuning an existing Phi-3 model into a transformer for diffusion task. It appears to have next-level multi-modal capability, like incorporating images as inputs and refer the images in the prompt text and compose them into the generated image in a flexible way.
It's capable model, but the architecture is surprisingly simple. Besides regular patchify-ing as in DiT and using a SDXL VAE, looks like it only adds a few layers on top of a standard Phi-3 model and changed the attention mask for image and timestamp tokens. This means an implementation of this model can largely borrow from existing impls for Phi3, DiT and VAE.
@LaurentMazare I can give it a try. Let me know if you are already working on it :D.
OmniGen is a new image generation model that is built by tuning an existing Phi-3 model into a transformer for diffusion task. It appears to have next-level multi-modal capability, like incorporating images as inputs and refer the images in the prompt text and compose them into the generated image in a flexible way.
It's capable model, but the architecture is surprisingly simple. Besides regular patchify-ing as in DiT and using a SDXL VAE, looks like it only adds a few layers on top of a standard Phi-3 model and changed the attention mask for image and timestamp tokens. This means an implementation of this model can largely borrow from existing impls for Phi3, DiT and VAE.
@LaurentMazare I can give it a try. Let me know if you are already working on it :D.