[Feature Request]: Pipeline for Expressive Text-to-Image Generation with Rich Text

soumik12345 commented 1 year ago

Is your feature request related to a problem? Please describe.

Event though plain text is currently the most dominant interface for text-to-image synthesis, its limited customization options hinder users from accurately describing desired outputs. {lain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret.

Describe the solution you'd like

A diffusion pipeline for Region-based diffusion process as proposed by the paper Expressive Text-to-Image Generation with Rich Text that can enable generation of accurate and complex images generation by accepting the prompts in a rich-text editor supporting formats such as font style, size, color, and footnote.

The plain text prompt is first input to the diffusion model to collect the self-attention and cross-attention maps. Attention maps are averaged across different heads, layers, and time steps.
The self-attention maps are then used to create the segmentation using spectral clustering and the cross-attention label each segment.
The original implementation uses rich text prompts obtained from the editor are stored in JSON format, providing attributes for each token span.
According to the attributes of each token, corresponding controls are applied as denoising prompt or guidance on the regions indicated by the token maps. The structure and background are preserved from plain-text generation by injecting the features or blending the noised samples.

Describe alternatives you've considered

While input as rich-text has a lot of advantages over plain-text, it is quite difficult to use without a GUI. In order to solve this issue, the Region-based diffusion pipeline proposed in this PR could also support additional modes of interface such as HTML or Markdown formatted prompts, that could be automatically parsed into the JSON format proposed by the authors of the paper.

The Proposed API

from diffusers import RegionDiffusionPipeline
from diffusers.utils import encode_html_to_json, encode_markdown_to_json

pipe = RegionDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")

prompt = "a <span style=\"color:red\">church</span> with beautiful landscape"
encoded_prompt = encode_html_to_json(prompt)

images = pipe(encoded_prompt, **kwargs).images

Additional context

Project page: https://rich-text-to-image.github.io
Diffusers-friendly implementation: https://github.com/songweige/rich-text-to-image
HuggingFace Spaces demo: https://huggingface.co/spaces/songweig/rich-text-to-image
Video from the authors: https://www.youtube.com/watch?v=ihDbAUh0LXk&t=1s

sayakpaul commented 1 year ago

Aware of rich diffusion and how impactful the work is! Thanks for bringing it up.

Since the author themselves implemented it in a diffusers-friendly fashion here (https://github.com/songweige/rich-text-to-image), we would rather redirect users to this repository to honor their contributions.

Thoughts @patrickvonplaten?

Cc: @apolinario

github-actions[bot] commented 12 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers