ljleb / sd-webui-neutral-prompt

Collision-free AND keywords for a1111 webui!
MIT License
187 stars 13 forks source link

Support stable-diffusion-webui-forge #69

Open wbclark opened 5 months ago

wbclark commented 5 months ago

Hijacks the forge_sample function and calls it separately for each individual text conditioning. The outputs are then recombined into a format that the extension already knows how to work with.

This doesn't use forge's patcher or take advantage of all of forge's performance optimizations brought over from ComfyUI via ldm_patched, but it does guarantee backwards compatibility for base A1111 users, and allows Forge users to use the extension without maintaining a separate install of A1111.

For a discussion of some of the challenges of manipulating denoised latents more directly using built-in functionality of Forge, see: https://github.com/hako-mikan/sd-webui-regional-prompter/issues/299

ljleb commented 5 months ago

Is this PR still a work in progress? I am playing a bit with the code and it looks relatively complete.

There is something I am not sure I completely grasp, I would appreciate if you could help clarify this part. There seems to be a lot going on in the code, is there something preventing us from calling the original forge function with each auxiliary latent stacked on the batch dimension and keep the settings as they are as input to forge_sample?

Good work overall!

ljleb commented 5 months ago

I think I should also clarify something. You may already know this, but let me explain this just in case.

The reason that the combine_denoised original webui implementation is called for the AND prompts at the base of the prompt tree when the extension is enabled is that other extensions also patch the combine_denoised function in A1111 (see for example the dynamic thresholding extension). If we don't use the original function as much as possible, then all previous patches are just overridden and it can break other extensions assumptions. In the case of forge, if anything changes in the original implementation, (i.e. a new paper implementation related to denoising) then we can be sure people will report errors when using the extension at some point.

If there is a way to implement this without monkey patching, we should consider it I think. I don't know exactly what builtin extension mechanisms are in place in forge, I need to investigate the code or maybe we should ask lllysaviel to add something for this use case / contribute a PR to forge ourselves.

wbclark commented 5 months ago

Is this PR still a work in progress? I am playing a bit with the code and it looks relatively complete.

Yes -- there's a lot of things that I want to improve/fix, but I don't mind merging a hacky but working version and cleaning it up in a second pass.

I'm about to be called away to a family engagement shortly. I'll be able to respond more fully to the comments later tonight and push some fixes I've been working on.

wbclark commented 5 months ago

The reason that the combine_denoised original webui implementation is called for the AND prompts at the base of the prompt tree when the extension is enabled is that other extensions also patch the combine_denoised function in A1111 (see for example the dynamic thresholding extension). If we don't use the original function as much as possible, then all previous patches are just overridden and it can break other extensions assumptions. In the case of forge, if anything changes in the original implementation, (i.e. a new paper implementation related to denoising) then we can be sure people will report errors when using the extension at some point.

If there is a way to implement this without monkey patching, we should consider it I think. I don't know exactly what builtin extension mechanisms are in place in forge, I need to investigate the code or maybe we should ask lllysaviel to add something for this use case / contribute a PR to forge ourselves.

The issue I think is what is returned here: https://github.com/lllyasviel/stable-diffusion-webui-forge/blob/main/ldm_patched/modules/samplers.py#L289

Generating at 896x1152 with a batch size of 1, with the prompt my first prompt AND my second prompt and no monkey patch, cond_pred will have shape torch.Size([1, 4, 144, 112])

If I use the same settings with the prompt

my first prompt
AND my second prompt
AND my third prompt
AND my fourth prompt
AND my fifth prompt
AND my sixth prompt
AND my seventh prompt
AND my eighth prompt
AND my ninth prompt

I still get cond_pred with shape torch.Size([1, 4, 144, 112]) -- because ldm_patched already combined the auxiliary conds and returned just the single cond latent

Increasing batch size to 8, then I get torch.Size([8, 4, 144, 112])

I think you knew this already, because you suggested:

There is something I am not sure I completely grasp, I would appreciate if you could help clarify this part. There seems to be a lot going on in the code, is there something preventing us from calling the original forge function with each auxiliary latent stacked on the batch dimension and keep the settings as they are as input to forge_sample?

I think that may be cleaner for playing nice with other extensions, the part I'm trying to figure out is how to perform the surgery on denoiser_params.text_cond and cond_composition to make that work.

ljleb commented 5 months ago

I think that may be cleaner for playing nice with other extensions, the part I'm trying to figure out is how to perform the surgery on denoiser_params.text_cond and cond_composition to make that work.

Yeah it seems tricky to make it work at a first glance. Let me see if I can get it to work on my side and I'll come back here with what I find. For personal reasons I don't want to falsely promise that I will find something by tonight, however I really intend to look into this today. I can also focus on something else if you want to tackle this part.

wbclark commented 5 months ago

Yeah it seems tricky to make it work at a first glance. Let me see if I can get it to work on my side and I'll come back here with what I find. For personal reasons I don't want to falsely promise that I will find something by tonight, however I really intend to look into this today. I can also focus on something else if you want to tackle this part.

so ldm_patched.modules.samplers.calc_cond_uncond_batch returns a pair of denoised latents (cond_pred, uncond_pred) and then applies classifier-free guidance (a custom implementation can be supplied via model_options) to compute a single cfg_result. if model_options includes any any sampler_post_cfg_function, those can further alter cfg_result before it is returned.

this extension wants to work with a tree structure of latent vectors and compute the cfg_result by traversing that tree, but ldm_patched can only compute cfg_result from two vectors at a time

I am thinking maybe we should work at a lower level, at ldm_patched.modules_samplers.sampling_function or ldm_patched.modules_samplers.calc_cond_uncond_batch, to call the original calc_cond_uncond_batch multiple times and collect the results.

for example, let's say the batch size is 2, and B=1 has 5 prompts while B=2 has only 2 prompts. we'd call calc_cond_uncond_batch 7 times, and collect the results like

[
  [
    (batch1_nocond_pred, batch1_uncond_pred),
    (batch1_cond1_pred, batch1_cond2_pred),
    (batch1_cond3_pred, batch1_cond4_pred),
    (batch1_cond5_pred, _)
  ],
  [
    (batch2_nocond_pred, batch2_uncond_pred),
    (batch2_cond1_pred, batch2_cond2_pred),
    (batch2_cond3_pred, batch2_cond4_pred)
  ]
]

(the noconds aren't necessary for current functionality, but they'd be almost free to implement with this design, and would enable cool features like https://github.com/comfyanonymous/ComfyUI/blob/master/comfy_extras/nodes_perpneg.py)

we'd then traverse the cond tree for each batch to compute the modified (cond_pred, uncond_pred) in a form that other extensions can still work with

an additional benefit would be that this would roughly cut in half the number of calls needed to calc_cond_uncond_batch vs. the current implementation in this PR, which makes one call to calc_cond_uncond_batch for every single cond (including the uncond) and effectively discards one half of the results

I think it still has to be a monkey patch rather than a change to forge itself, because forge doesn't have any concept of manipulating trees of conds or latents

ljleb commented 5 months ago

for example, let's say the batch size is 2, and B=1 has 5 prompts while B=2 has only 2 prompts. we'd call calc_cond_uncond_batch 7 times, and collect the results like

I looked at the code and the web API docs for the txt2img route, and now I'm skeptical of my previous statement on B being different for each index in N because I only see a single prompt in the suggested JSON body. I recall that at least within an extension in A1111, it is possible to set a different prompt for each batch index in the p object.

So after seeing this, I made a test extension just now that overwrites the second prompt in the batch with 3 positive AND conds while leaving the first prompt with no AND, and it seems that in forge, forge_sampler.cond_from_a1111_to_patched_ldm_weighted() discards any extraneous condition (zip(*weights) clips the extra conditions). It keeps only the smallest number of conditions. So my assumption above that N can be different for each index in B doesn't hold in forge. However, neutral prompt extension will expect conds that will be missing if we discard them in larger trees, which will definitely create issues (it will raise an exception at generation time), so maybe we can use padding for this? I'm not sure honestly.

for example, let's say the batch size is 2, and B=1 has 5 prompts while B=2 has only 2 prompts. we'd call calc_cond_uncond_batch 7 times, and collect the results like

Yeah maybe that'd be a better approach to padding... I believe we have to change the behavior of forge in this way, so as to allow a different number of conds per batch index, to prevent crashes in the extension.

(the noconds aren't necessary for current functionality, but they'd be almost free to implement with this design, and would enable cool features like https://github.com/comfyanonymous/ComfyUI/blob/master/comfy_extras/nodes_perpneg.py)

Perpneg is implemented with the AND_PERP keyword already. The tree structure is supposed to enable these alternative denoising methods. I'm not sure I fully understand the "nocond" concept, so please ignore this paragraph if it doesn't add any useful info for you.

I think it still has to be a monkey patch rather than a change to forge itself, because forge doesn't have any concept of manipulating trees of conds or latents

I think that's a valid point, it's much simpler to implement it like this and I totally support this approach. Another possibility to look at could be to have in forge a pre-processing list of modifiers and a post-processing list of modifiers in reverse order of registration. Like this, we can change the shape of what needs to be processed before and after denoising while respecting the order of other patches and whatever expectation of shapes they have. But that's a bit involved... I think it could be tackled at a different time if we ever do it.

Of course the simplest solution would be to have a tree of prompts natively in the base repo but I don't think forge maintainers would welcome this kind of contribution as it is a concept very specific to neutral prompt.

ljleb commented 5 months ago

Yeah maybe that'd be a better approach to padding... I believe we have to change the behavior of forge in this way, so as to allow a different number of conds per batch index, to prevent crashes in the extension.

Another approach might be to just catch the error and print it when conds are missing in the tree, and then continue generation normally. It isn't a very frequent use case to begin with. IIUC it would simplify the implementation to just let forge strip the extra conds?

wbclark commented 5 months ago

I think that's a valid point, it's much simpler to implement it like this and I totally support this approach. Another possibility to look at could be to have in forge a pre-processing list of modifiers and a post-processing list of modifiers in reverse order of registration. Like this, we can change the shape of what needs to be processed before and after denoising while respecting the order of other patches and whatever expectation of shapes they have. But that's a bit involved... I think it could be tackled at a different time if we ever do it.

Of course the simplest solution would be to have a tree of prompts natively in the base repo but I don't think forge maintainers would welcome this kind of contribution as it is a concept very specific to neutral prompt.

Maybe forge would accept a contribution of a calc_multi_latent_batch method. That would be useful to other extensions as well, and there seems to be a recognition of the value it would add already: https://github.com/lllyasviel/stable-diffusion-webui-forge/issues/242#issuecomment-2008147318

I think a good path forward is

  1. monkey patch ldm_patched.modules.samplers.sampling_function to move this PR forward and get this extension working on forge
  2. submit a draft PR to forge for multi latent sampling, get feedback from forge and regional prompter maintainers
  3. reduce reliance on monkey patch here once forge natively supports multi latent sampling
wbclark commented 5 months ago

I pushed a draft of the type of implementation I have in mind

It is surely bug ridden and likely doesn't work at all yet, but I have to step away for the evening

ljleb commented 5 months ago
  1. monkey patch ldm_patched.modules.samplers.sampling_function to move this PR forward and get this extension working on forge
  2. submit a draft PR to forge for multi latent sampling, get feedback from forge and regional prompter maintainers
  3. reduce reliance on monkey patch here once forge natively supports multi latent sampling

Seems to be a sensible approach, let's go with that.

I pushed changes that make the code work at a basic level. With multiple AND keywords, it seems to give different results when the extension is enabled vs when the extension is disabled, this needs to be investigated (it is probably my fault). I am also not totally sure that patches and other extensible properties of conds (like cond['area']) are properly working, although it is a possibility that the current code respects comfy-like extensions properly. I intend look into it further tomorrow.

wbclark commented 5 months ago

With multiple AND keywords, it seems to give different results when the extension is enabled vs when the extension is disabled, this needs to be investigated (it is probably my fault).

Last night, I tested prompt forms like

[
  first prompt
  AND second prompt
] AND [
  third prompt
  AND fourth prompt
]

with assigned weights of :0 for different combinations of leaf prompts and bracketed groups, and in all cases the generated result was the same as if the zero'd out prompts were removed completely

this leads me to believe that the issue is not with the way latents are combined, but rather with the way we're trying to sample latents in the first place -- maybe ldm_patched doesn't like treating conds and unconds as interchangeable :/

if I have some time on my lunch break, i'll try calling the original function once per cond to see if it makes a difference

ljleb commented 5 months ago

this leads me to believe that the issue is not with the way latents are combined, but rather with the way we're trying to sample latents in the first place -- maybe ldm_patched doesn't like treating conds and unconds as interchangeable :/

I made some progress on this, it seems that in the original samplers.sampling_function(), CFG is multiplied by the sum of cond["strength"] (see how edit_strength is computed). I am not sure why this decision was made, but it makes even the builtin AND behave differently than in A1111 already.

While it might not be the only problem in the code, taking this into consideration fixes some of the issues I was having yesterday.

Do you mind if I push my attempt at fixing this to your branch? I don't want to conflict with any local changes you have.

wbclark commented 5 months ago

I made some progress on this, it seems that in the original samplers.sampling_function(), CFG is multiplied by the sum of cond["strength"] (see how edit_strength is computed).

lol, that would explain why results were so much better for me when I implemented weight normalization >_<

I am not sure why this decision was made, but it makes even the builtin AND behave differently than in A1111 already.

wow you're right, and it's forge specific / not inherited from comfy -- https://github.com/lllyasviel/stable-diffusion-webui-forge/commit/998327c744d83a9327289c5639afb603cc2d3464

might be worth raising an issue there to inquire

Do you mind if I push my attempt at fixing this to your branch? I don't want to conflict with any local changes you have.

go for it, I can test on a different branch or rebase as needed

ljleb commented 5 months ago

Damn this being forge specific is really surprising. I pushed the changes, let me know if it helps in any way with your testing (if you find time of course).

wbclark commented 5 months ago

I got the same results when I sampled once per cond, as I did when I sampled in pairs treating every other cond as uncond, so our efficient sampling idea works :tada:

I also got consistent results when distributing weights of bracketed groups over individual prompt weights, so your fix seems to work? but see below for a question about why

wbclark commented 5 months ago

@ljleb I fixed an issue that occurred when using this extension together forge's builtin self-attention guidance extension, which occurred because that extension calls calc_cond_uncond_batch separately after computing CFG, at https://github.com/lllyasviel/stable-diffusion-webui-forge/blob/main/ldm_patched/contrib/external_sag.py#L155

ljleb commented 5 months ago

Awesome! Nice find. This is starting to look solid.

For personal reasons, I have not been able to bring myself to finish testing/reviewing unmerged work on this repo. As you spent a lot of time and efforts here, I'm starting to feel really bad with this situation. There is unfortunately not much I can do about it in the short term, but I'm actively seeking a solution.

As you seem to have a couple ideas with the tree of prompts and other syntax extensions, I was initially considering exceptionally giving away ownership to you eventually if I can't resolve my personal reasons somehow (assuming you would be willing to take this burden).

I am however not really in favor of giving away ownership to anyone usually, and now especially in light of the recent xz situation. If the repo becomes stale for long enough (like a couple of months) and your fork has useful additions in it, w-e-w might accept a PR for changing the URL in the extensions listing.

Before it gets to that point, I intend to test this code and move forward with it when I get the opportunity, but I can't promise anything. Sorry for the trouble this causes.

ljleb commented 5 months ago

Upon testing this further, I find that a basic prompt with a single AND does not reproduce the results I get with the extension disabled. I learned why edit_strength was used, in fact it is because forge merges the prompts to a normalized scale in calc_cond_uncond_batch. So to restore the behavior people get in a1111, it multiplies the entire batch by the sum of the scale in sampling_function.

wbclark commented 5 months ago

Upon testing this further, I find that a basic prompt with a single AND does not reproduce the results I get with the extension disabled. I learned why edit_strength was used, in fact it is because forge merges the prompts to a normalized scale in calc_cond_uncond_batch. So to restore the behavior people get in a1111, it multiplies the entire batch by the sum of the scale in sampling_function.

Thank you, I'll block out some time to look into this tonight.

ljleb commented 5 months ago

Thank you, I'll block out some time to look into this tonight.

I have started working on this and fixed it, but the results are a little bit different still for some other reason (I suspect it is caused by the top level AND prompts being merged in delta space, which is not exactly how they are merged normally. it is mathematically equivalent but might be creating imprecisions in the denoising process).

I will take some time to finish this tomorrow if I can't do it tonight.