Add a family of AND_ALIGN_D_S keywords

wbclark commented 7 months ago

Hey @ljleb , I'm looking for feedback on a novel method that I developed for blended guidance.

The idea is to compare guidance conds in two different-sized windows around each latent pixel: a detail window of size DxD and a structure window of size SxS. For each latent pixel, we compute a pair of alignment maps between the two tensors, called the detail_alignment and structure_alignment respectively.

The detail_alignment map considers all 2x2 sub-regions in the DxD regions and computes the cosine similarity between the parent and child tensors in each 2x2 sub-region, then averaging these values over all such sub-regions in the DxD regions. The structure_alignment map is computed similarly, using regions of size SxS instead for the structure.

Because cosine similarity of two random vectors in R^n tends to 0 as n grows sufficiently high (intuitively, random vectors tend to be increasingly orthogonal as the dimension of the space increases), averaging cosine similarity over 2x2 sub-regions instead of directly computing cosine similarity in regions of size DxD and SxS, is used as a normalization method to ensure the values computed at different resolutions DxD and SxS are comparable.

The detail alignment will tend to be higher when the details are similar between the two latents, and have a negative value instead when the details in the child latent are contrasting to the details in the parent latent, indicating that the second prompt contains novel details that can be blended. The structure alignment will instead be positive when the structure is similar, and negative where the child latent would significantly diverge from the compositional structure of the parent at resolution SxS.

An alignment weight is computed by starting with the structure alignment and subtracting the detail alignment, giving a single alignment map which is positive when the child latent guidance can enhance the details of the parent latent guidance, without disrupting its structure. The negative values are clamped, and each latent pixel is blended according to its resulting alignment weight.

I currently have this implemented for a range of values for D and S, from 2 to 33 latent pixels each, for experimentation. Decreasing the value of D will typically make it easier for the child prompt to influence the details of the resulting image, while increasing the value of S will work to relax the preservation of higher level compositional structure. For most prompts, structure prompt AND_ALIGN_3_7 detail prompt feels like a good starting point, but I recommend trying different combinations with a range of different prompts to get a feel for how they behave.

I've tested this method rather extensively and found it to be very useful, and I'm excited to share it. Please let me know if you have any questions, comments, suggestions, feedback, etc.

ljleb commented 7 months ago

This looks really cool! I'll run some generations later today to try to get a feel for it.

From a usability perspective, we might want to add 2 sliders in the prompt formatter (1 for D, 1 for S) that become visible when selecting "Alignment blend". Using a AND_ALIGN b without D and S could fallback on reasonable defaults, perhaps 3 and 7 as you suggested.

Instead of generating all possible keywords, we could use a regex in the parser and make the conciliation strategy an actual object instead of an enum (so that it can contain extra data like D and S). In rust enums can have associated values, but in python maybe it isn't super practical to keep it an enum after all.

Can't wait to try this!

wbclark commented 7 months ago

From a usability perspective, we might want to add 2 sliders in the prompt formatter (1 for D, 1 for S) that become visible when selecting "Alignment blend". Using a AND_ALIGN b without D and S could fallback on reasonable defaults, perhaps 3 and 7 as you suggested.

Indeed. For me, a big unresolved question is how to make this a tool that people can use easily and intuitively without a deep understanding of diffusion models, latent space, linear algebra, differential equations. While ideally still allowing people to tweak parameters if they want to, in order to dial in a specific vision of what they are really trying to express.

And since it's your project, I thought it's a better approach to present the general method first and then start a conversation about what is the best UX.

One idea that occurs to me is that we could ship a few presets like AND_ALIGN_FINE, AND_ALIGN_MEDIUM, AND_ALIGN_COARSE, and a general AND_ALIGN_D_S with sliders like you suggested for D and S. I'm new to Gradio though, and might need some suggestions on the best way to implement sliders for a single keyword only.

Can't wait to try this!

Cool, I'm glad. :)

Some of the more interesting tests that I did beyond just varying D and S, that helped develop some intuition for how it works:

I. When the two prompts are very similar and have only subtle differences II. When the two prompts are very different III. Disabling AuxCondDeltaVisitor for first ~10% of steps (Very powerful for giving the first prompt a head start to define the composition, then structure alignment will already be higher when 2nd prompt is enabled, although this same principle works well with other conciliation strategies as well -- https://github.com/ljleb/sd-webui-neutral-prompt/issues/25)

Some tests I'd like to explore further:

IV. Is there any use for D > S ? It's not exactly the same interchanging the two prompts (due to the clamping of negative alignment weight, I think) V. Testing without the clamping of negative alignment weight VI. Computing a binary alignment mask instead, like AND_SALT (early results were very promising here, but I haven't tried it since I made the change to use average similarity over 2x2 sub-neighborhoods) V. Nesting like [ a AND_ALIGN_11_19 b ] AND_ALIGN_3_11 c

wbclark commented 7 months ago

To facilitate comparison of the techniques without repeated checking out a different commit and restarting, I pushed an additional commit that adds AND_MASK_ALIGN_D_S which uses a binary alignment_mask instead of soft alignment_weight. (I didn't add UI support for it, but the prompt keyword is parsed)

This commit is purely for experimentation. AND_ALIGN_D_S is unchanged. I will try to carve out some time for a thorough comparison later.

ljleb commented 7 months ago

Apologies for the delay here. I generated 2 large grids locally to try to understand the method a bit better but didn't take more time yet.

This commit is purely for experimentation. AND_ALIGN_D_S is unchanged. I will try to carve out some time for a thorough comparison later.

Sure. We should look for the one that gives the best results and only keep this one in my opinion. Let me know if this is what you intended to do.

Is it okay with you if I update the UX code and refactor the parser directly in your PR?

wbclark commented 7 months ago

Sure. We should look for the one that gives the best results and only keep this one in my opinion. Let me know if this is what you intended to do.

Yes, we're on the same page there. After some further testing today, I think I like AND_ALIGN_D_S better as it allows better blending. The one disadvantage is that sometimes you have to increase the weight of the prompt quite a bit to get the intended effect to be strong enough.

Is it okay with you if I update the UX code and refactor the parser directly in your PR?

Yeah, it's totally fine with me. Reading the code in this extension is what inspired the idea in the first place and I'm happy to work together on it.

ljleb / sd-webui-neutral-prompt

Add a family of AND_ALIGN_D_S keywords #63