v0xie / sd-webui-incantations

Enhance Stable Diffusion image quality, prompt following, and more through multiple implementations of novel algorithms for Automatic1111 WebUI.
GNU General Public License v3.0
120 stars 7 forks source link
extension stable-diffusion-webui stable-diffusion-webui-plugin

sd-webui-incantations

Table of Contents

What is this?

This extension for AUTOMATIC1111/stable-diffusion-webui implements algorithms from state-of-the-art research to achieve higher-quality images with more accurate prompt adherence.

All methods are training-free and rely only on modifying the text embeddings or attention maps.

Installation

To install the sd-webui-incantations extension, follow these steps:

  1. Ensure you have the latest Automatic1111 stable-diffusion-webui version ≥ 1.93 installed

  2. Open the "Extensions" tab and navigate to the "Install from URL" section:

  3. Paste the repository URL into the "URL for extension's git repository" field:

    https://github.com/v0xie/sd-webui-incantations.git
  4. Press the Install button: Wait a few seconds for the extension to finish installing.

  5. Restart the Web UI: Completely restart your Stable Diffusion Web UI to load the new extension.

Compatibility Notice

News

Extension Features


Semantic CFG (S-CFG)

https://arxiv.org/abs/2404.05384
Dynamically rescale CFG guidance per semantic region to a uniform level to improve image / text alignment.
Very computationally expensive: A batch size of 4 with 1024x1024 will max out a 24GB 4090.

Controls

Results

Prompt: "A cute puppy on the moon", Min Rate: 0.5, Max Rate: 10.0

Also check out the paper authors' official project repository:


Perturbed Attention Guidance

https://arxiv.org/abs/2403.17377
An alternative/complementary method to CFG (Classifier-Free Guidance) that increases sampling quality.

Update: 20-05-2024

Implemented a new feature called "Saliency-Adaptive Noise Fusion" derived from "High-fidelity Person-centric Subject-to-Image Synthesis".

This feature combines the guidance from PAG and CFG in an adaptive way that improves image quality especially at higher guidance scales.

Check out the paper authors' project repository here: https://github.com/CodeGoat24/Face-diffuser

Controls

Results

Prompt: "a puppy and a kitten on the moon"

Also check out the paper authors' official project page:

Return to top


CFG Interval / CFG Scheduler

https://arxiv.org/abs/2404.07724 and https://arxiv.org/abs/2404.13040

Constrains the usage of CFG to within a specified noise interval. Allows usage of high CFG levels (>15) without drastic alteration of composition.

Adds controllable CFG schedules. For Clamp-Linear, use (c=2.0) for SD1.5 and (c=4.0) for SDXL. For PCS, use (s=1.0) for SD1.5 and (s=0.1) for SDXL.

To use CFG Scheduler, PAG Active must be set True! PAG scale can be set to 0.

Controls

Results

CFG Interval

Prompt: "A pointillist painting of a raccoon looking at the sea."

CFG Schedule

Prompt: "An epic lithograph of a handsome salaryman carefully pouring coffee from a cup into an overflowing carafe, 4K, directed by Wong Kar Wai"

Return to top


Multi-Concept T2I-Zero / Attention Regulation

Update: 29-04-2024

The algorithms previously implemented for T2I-Zero were incorrect. They should be working much more stably now. See the previous result in the 'images' folder for an informal comparison between old and new.

Implements Corrections by Similarities and Cross-Token Non-Maximum Suppression from https://arxiv.org/abs/2310.07419

Also implements some methods from "Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models" https://arxiv.org/abs/2403.06381

Corrections by Similarities

Reduces the contribution of tokens on far away or conceptually unrelated tokens.

Cross-Token Non-Maximum Suppression

Attempts to reduces the mixing of features of unrelated concepts.

Controls:

Known Issues:

Can error out with image dimensions which are not a multiple of 64

Results:

Prompt: "A photo of a lion and a grizzly bear and a tiger in the woods"
SD XL
image

Also check out the paper authors' official project pages:

Return to top


Seek for Incantations

An incomplete implementation of a "prompt-upsampling" method from https://arxiv.org/abs/2401.06345
Generates an image following the prompt, then uses CLIP text/image similarity to add on to the prompt and generate a new image.

Controls:

For example, if your prompt is "a blue dog", delimiter is "BREAK", and word replacement is "-", and the level of similarity of the word "blue" in the generated image is below gamma, then the new prompt will be "a blue dog BREAK a - dog"

A WIP implementation of the "prompt optimization" methods are available in branch "s4a-dev2"

Results:

SD XL

Return to top


Issues / Pull Requests are welcome!


Tutorial

Improve Stable Diffusion Prompt Following & Image Quality Significantly With Incantations Extension

image

Return to top

Also check out:

Return to top


Credits

Return to top