rongyaofang/PUMA - Githubissues

PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation

[Rongyao Fang](https://scholar.google.com/citations?user=FtH3CW4AAAAJ&hl=en)^1\*, [Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=zh-CN)^2\*, [Kun Wang]()³, [Hao Li](https://scholar.google.com/citations?user=qHqQsY4AAAAJ&hl=zh-CN)^1,4, [Hao Tian]()³, [Xingyu Zeng]()³, [Rui Zhao]()³, [Jifeng Dai](https://jifengdai.org/)^4,5, [Hongsheng Li](https://www.ee.cuhk.edu.hk/~hsli/)^{1 :envelope:}, [Xihui Liu](https://xh-liu.github.io/)^{2 :envelope:} ¹CUHK MMLab, ²HKU MMLab, ³SenseTime, ⁴Shanghai AI Laboratory, ⁵Tsinghua University *Equal contribution, :envelope:Corresponding authors

:fire: We will release the code and models soon!

:new:Update

2024.10.18: PUMA preprint is released on ArXiv :fire:.
2024.10.17: PUMA homepage is now available :fire:.

:hourglass: TODO

[x] Update links to project page :link:
[ ] Release visual encoder and decoders checkpoints :computer:
[ ] Release MLLM backbone checkpoint :floppy_disk:

Abstract

PUMA introduces a unified multimodal large language model framework designed to integrate multi-granular visual generation and understanding. Our model excels in a variety of visual tasks, including diverse text-to-image generation, precise image editing, conditional image generation, and visual understanding. It strikes a balance between generation diversity and controllability, making it a versatile tool for visual tasks.

Read the full paper here.

Framework

PUMA leverages multi-granular visual representations as unified inputs and outputs for MLLM, allowing it to handle a variety of visual tasks, including text-to-image generation, image editing, inpainting, colorization, conditional generation, and image understanding.

Multi-granular Semantic Visual Decoding

PUMA's visual decoding process spans five granular image representations (f₀ to f₄) and corresponding decoders (D₀ to D₄), which are trained using SDXL. This allows PUMA to achieve precise image reconstruction and semantic-guided generation, supporting both control and diversity in image generation tasks.

Diverse Text-to-image Generation

Image Editing

Image Conditional Generation

Citation

If you find PUMA useful in your research, please consider citing us:

@article{fang2024puma,
  title     ={PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation},
  author    ={Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu},
  journal   ={arxiv},
  year      ={2024}
}

License

This project is released under the Apache 2.0 license.

Contact

If you have any questions, please feel free to contact rongyaofang@gmail.com.
Rongyao Fang anticipates graduating in 2025 and is open to both academic and industrial research positions. If you are interested, please feel free to contact.