Closed soumik12345 closed 11 months ago
Aware of rich diffusion and how impactful the work is! Thanks for bringing it up.
Since the author themselves implemented it in a diffusers-friendly fashion here (https://github.com/songweige/rich-text-to-image), we would rather redirect users to this repository to honor their contributions.
Thoughts @patrickvonplaten?
Cc: @apolinario
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Is your feature request related to a problem? Please describe.
Event though plain text is currently the most dominant interface for text-to-image synthesis, its limited customization options hinder users from accurately describing desired outputs. {lain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret.
Describe the solution you'd like
A diffusion pipeline for Region-based diffusion process as proposed by the paper Expressive Text-to-Image Generation with Rich Text that can enable generation of accurate and complex images generation by accepting the prompts in a rich-text editor supporting formats such as font style, size, color, and footnote.
Describe alternatives you've considered
While input as rich-text has a lot of advantages over plain-text, it is quite difficult to use without a GUI. In order to solve this issue, the Region-based diffusion pipeline proposed in this PR could also support additional modes of interface such as HTML or Markdown formatted prompts, that could be automatically parsed into the JSON format proposed by the authors of the paper.
The Proposed API
Additional context