klimaleksus / stable-diffusion-webui-embedding-merge

Extension for AUTOMATIC1111/stable-diffusion-webui for creating and merging Textual Inversion embeddings at runtime from string literals.
The Unlicense
110 stars 10 forks source link

Better SDXL support? Individual control over two CLIPs #17

Open aleksusklim opened 7 months ago

aleksusklim commented 7 months ago

How merge expression syntax could be enhanced to incorporate an independent manipulation or L (CLIP as in SD1) and G (OpenCLIP) clips of SDXL?

Currently <'cat'*2+'1girl'> will:

  1. Multiply L and G of "cat" by 2.0, independently.
  2. Pad shortest token ("cat") with zero vectors to max length (of "1girl" which is 2)
  3. Sum L of padded "cat" with L of "1girl" and put to L; accordingly, do the same with G.

What do we want:

What we cannot have:

A few ideas:

  1. Two different merge expressions, controlling L (first part) and G (second part) separately:

<'use clip'*1.4 | 'this is OpenCLIP'*0.5>

  1. Zero-fill L/G operator:

<'this is OpenCLIP'*0.5:G + use clip'*1.4:L> ('X':L will zero-fill G-part of 'X'; read as "use L" )

Also see https://github.com/klimaleksus/stable-diffusion-webui-embedding-merge/commit/a89dde64440f24a872231259304e148d2258da8b#commitcomment-140709559

aleksusklim commented 7 months ago

Also, how about a checkbox in EM tab named Save converted SD1/SD2/SDXL versions in separate files, that will:

  1. In SD1 mode create /embeddings/embedding_merge/SDXL/%name%.safetensors with zero G but 768 of L
  2. In SD2 mode create /embeddings/embedding_merge/SDXL/%name%.safetensors with zero L but 1280 of G
  3. In SDXL mode create /embeddings/embedding_merge/SD1/%name%.safetensors with 768 of L and /embeddings/embedding_merge/SD2/%name%.safetensors with 1280 of G

This way you would be able to convert between different embeddings by checking this flag and loading a correct base model to switch mode!

aleksusklim commented 7 months ago

Can anybody explain me why I get embedding hidden size for SD2 as =1024 but for SDXL part G as =1280 ? For SD1 and SDXL part L it is =768 as expected.

silveroxides commented 6 months ago

EDIT: realized that | also could be used instead of + so instead of merging two sentences targeting both clip it merges two sentences that target one clip each. That way it will not interfer with the single quoted part like the example below might do.

What about if just adding | inside single quotes would make it so anything before it targets CLIP L and anything after targets CLIP G and that it also changes position of math operations and such to the left side of the singe quotes for CLIP L and right side for CLIP G. Would probably be good to add a character to the syntax that indicates that any character after it that is appended by a numerical value are math operations and ' > and + without appended numerical value assumes their default functions. In the following example the # is the indicator that there is math operations. The sentence a blue dog is only acting on CLIP L and is multiplied by 1.25 white a red dog is only acting on CLIP G and is divided by 0.8, then they are merged with a green dog since the + after the last math operation is not followed by a number and that sentence targets both CLIP and is multiplied by 0.5. Does this seem reasonable @aleksusklim ?

<#*1.25'a blue dog|a red dog'#/0.8+'a green dog'#*0.5>

Separated prompts for two different text encoders seem unnecessary. Separated prompts for the base model and refiner may work, but the effects are random, and we refrain from implementing this.

Also this statement about separately prompting clip that fooocus maintainer wrote can be dismissed. I have proof that under the right circumstances, separately prompting the clip models can provide significant improvement. I have done extensive experiments on this.

aleksusklim commented 6 months ago

<#*1.25'a blue dog|a red dog'#/0.8+'a green dog'#*0.5>

I don't understand this. Firstly, any runtime merge expression ought to start with single quote, otherwise it won't get parsed (and will mess up with other extensions if I'd try to interpret it), so the only valid start is <' or <'',

Secondly, you seem to include a control character | inside single quotes. This is wrong, because currently there are no prohibited symbols inside quotes (actually, even the single quote itself can be freely used: to do this you'll have to double it, for example cat's should be <'cat''s'>; I don't see this documented anywhere in the docs, but it was possible from the very beginning!)

| also could be used instead of + so instead of merging two sentences targeting both clip it merges two sentences that target one clip each

Show some examples, and note that I cannot delay multiplication for anything but the directly preceding term, so we cannot have "multiplication from left" like X*'S', but only 'S'*X

I have done extensive experiments on this.

Where, with what software? (Comfy, Diffusers?)

silveroxides commented 6 months ago

I realized the | issue inside single quotes hence the edit. That is why in edit I hinted towards another method.

<'a blue dog'#*1.25|'a red dog'#/0.8+'a green dog'*0.5>

'a blue dog'#*1.25 would represent the CLIP L part | would indicate that the single quoted to the left is L and to the right is G 'a red dog'#/0.8 would represent the CLIP G part + would function as normal (in this case the L and G parts to left that are merged with different tokens but at the same location in prompt will merge with the one on right that have same tokens on both clip) 'a green dog'*0.5 is created with both CLIP. The # would indicate single CLIP operation and unless there is a presceeding | then that CLIP is L. If there is a | presceeding prompt then it is CLIP G and will be merged as such In the case of only wanting one CLIP to and other to be padded with zero then you would just leave that single quote empty followed by only a# followed by | if CLIP L or one of the following if CLIP G: +' if more merges are being done or > if nothing else. Note that padding should be done to the same token amount as the one that is not padded.

<''#|'a red dog'#/0.8>
<'a blue dog'#*1.25|''#>
aleksusklim commented 6 months ago

Confusing.

Couldn't you just |'string to indicate it as L and #'string to indicate it as G, at that rate?

silveroxides commented 6 months ago

Confusing.

Couldn't you just |'string to indicate it as L and #'string to indicate it as G, at that rate?

You are right. I do tend to overcomplicate some things. As long as |'string if used alone also does torch.zeros on G and #'string if used alone also does torch.zeros on L it should be fine i suppose.

aleksusklim commented 6 months ago

Give several examples how you would use this, especially if you told that you already have experience in messing with two separate prompts?

silveroxides commented 6 months ago

Well the influence over image is not equal between the two CLIP models but by multiplying the magnitude of embedding only using L CLIP this can be overcome and since L CLIP is same as SD 1.5 CLIP it has all the openai training still there. I have already used this but in a workaround manner by creating embedding with SD 1.5 model and then convert them to work with SDXL by zero padding G. If you check the Abs parameter when parsing you can see that G value is consistently higher than L. Even these out and prompt coherence goes up as well

aleksusklim commented 6 months ago

So you actually need a separate multiplication? Like *L1.7 and *G0.8 instead of just *1.7 and *0.8 ?

This way, to get pure L you will just 'string'*G0 Would that be enough?

silveroxides commented 6 months ago

Yes that sounds great. It makes sense too since if you are going to target only one clip you would want to use multiplication in order to compensate a bit. At least from my own experience.

Also here are three embeddings that were converted from SD 1.5 to SDXL with the padding technique if you want to check them out for effectiveness, parameters and such: xlconverted.zip

aleksusklim commented 6 months ago

By chance, maybe you know why G part is not compatible with SD2 ? I thought there is OpenCLIP in both SD2 and SDXL.

silveroxides commented 6 months ago

Because the OpenCLIP model used by SD 2.0-2.1 is not G. I believe it is H and the hidden dim size of G is 1280 while H is 1024. Below is screenshot of each text encoders configuration file

Screenshot_20240529_095623.jpg

Screenshot_20240529_095747.jpg

aleksusklim commented 6 months ago

I've pushed two changes:

  1. Now multiplication and division supports L and G suffix: 'a cat'/2G, 'test'*1.5L. Only literal uppercase "L" and "G" are allowed, directly after the number. To keep only L vectors you should do *0G
  2. Now there is a checkbox I described earlier, https://github.com/klimaleksus/stable-diffusion-webui-embedding-merge/issues/17#issuecomment-2050196428 but without SD2 part. So, each saved embedding by default is automatically converted to SD1 or SDXL if possible, and saved with the same name to a subfolder as safetensors.

The documentation is not updated yet. Can you test everything and make sure it is working as you might expect, and that nothing got broken?

silveroxides commented 5 months ago

Everything seemed to be working well but at one point, whatever I put in negative prompt became positive instead for some reason. Gonna investigate it some more. Been doing all kinds of crazy stuff though so it does seem to be working overall

silveroxides commented 5 months ago

So yeah things are working as they should. One suggestion though is in addition to placing the safetensor converted embedding when saving is to add a suffix to it since without that, sdxl embedding sharing same name as sd15 embedding will not show up in extensions such as tag autocomplete but instead shows just as the sd15 version. I have gotten used to naming mine with suffixes '_xl' and '15', but something like 'vXL' and 'v1' would be more clear.

aleksusklim commented 5 months ago

Why to use a prefix if you naturally cannot have loaded both SD1 and SDXL versions in WebUI at the same time?

silveroxides commented 5 months ago

Because. When an SDXL model is loaded the extension a1111-sd-webui-tagcomplete is unable to differentiate between the two since it is only used for aliasing and quick acess to embeddings, loras and such through prompt. So if two embeddings has the same name, it then displays it as a SD 1.5 embedding. In image I have an SDXL model loaded, I am using extension in prompt while displaying the actual available SDXL embedding and you can see that the one with the exact same name is displayed as v1 Embedding even though there obviously is a XL one available. That is cause that extension is not meant to do checks for loaded model or anything like that. It is just performing aliasing and prompt shortcuts for embeddings and extra networks. You will have to excuse the name but it is the only one that was left that I had not suffixed. Hope this explains it. Otherwise I suggest you check out the extension I mentioned so you get first hand experience. The extension image

aleksusklim commented 5 months ago

the extension a1111-sd-webui-tagcomplete is unable to differentiate between the two

And so what? The embedding is there, and it will be used in generation.

It is just performing aliasing and prompt shortcuts for embeddings and extra networks.

That extension should not list embeddings that are not compatible with the current model, because this is a lie that they are usable: WebUI would not throw any errors but instead will take the name literally as text, without substitution.

Showing the wrong type of the embedding because of duplicated name is not a bigger lie!

I have gotten used to naming mine with suffixes '_xl' and '15', but something like 'vXL' and 'v1' would be more clear.

Why to rename them, if it would be convenient for prompts to keep general names of embeddings which would allow you to swap models without changing the prompt?

For example, if your SDXL embedding of a furry dog boy is catgirl1 and you have its L part stored as catgirl1 too, then your prompt would work regardless of what the current model is, SDXL or SD1.

silveroxides commented 5 months ago

Yeah I will just head over to that extensions repo and ask them to change their entire way of fetching embedding/extra networks names.

I have 2600 embeddings. If I would have same name on both xl and v1 variants, currently it would just display as v1 in that menu and I would be clueless to know if that is one that has one for each architecture or if it is one that I have yet to convert. So no there is no convenience by having them being named the same in that context. I would however understand the convenience for casual users that does not use EM for constructing highly complex embeddings through multiple intermediary steps like I do.

aleksusklim commented 5 months ago

Yeah I will just head over to that extensions repo and ask them to change their entire way of fetching embedding/extra networks names.

You may backlink here when you do; meanwhile I will be updating my docs for the new syntax…

silveroxides commented 5 months ago

The tagcomplete issue has been resolved.

By the way, my PR has been merged to webui dev branch. It is now possible to unlock clip skip option for clip L when using SDXL which can bring some benefits, especially if combined with prompt editing timelines and this extension. Link to the pull request if you want to take a look.