klimaleksus / stable-diffusion-webui-embedding-merge

Extension for AUTOMATIC1111/stable-diffusion-webui for creating and merging Textual Inversion embeddings at runtime from string literals.
The Unlicense
106 stars 10 forks source link

embedding weights work differently? #2

Open placeboyue opened 1 year ago

placeboyue commented 1 year ago

I tried creating a new embedding from one i use with a 0.6 weight all the time. The results from the new generated embedding (using the syntax shared here, and trying many different numbers from 0.9 to 0.2 or so) differ a lot from using the original embedding with 0.6 weight, as clearly seen by comparing generations with the same seed.

Is there a structural difference to the operation done by the normal weights and the ones applied by EM? is there a way to save the original 0.6 weight just as it is?

an example below. While the style is still there, the resulting image is completely different, and depending on the weight used, it works nearly the same as using the original with weight 1

00102-4041420028 00101-4041420028

placeboyue commented 1 year ago

btw don't get me wrong, many times the results from the new embeddings created with this are better than the original 0.6 weighted ones. They retain the composition of the full weight image, while un-breaking the over-trained glitches. The 0.6 weight i use loses a lot of the original style. I just wanna understand exactly how this work and how to use it. This is almost certainly not a bug or something to be fixed

xyz_grid-0024-3774065690

aleksusklim commented 1 year ago

By "weight" do you mean multiplication syntax <'nardostyle'*0.6> ?

If so, then yes it is working at intended, and it has nothing to do with attention set by round parenthesis.

Recently I inspected WebUI sources to better understand how CLIP works. Here is my updated understanding of the complete process, but I also may be wrong somewhere in details:

1) Your positive and negative prompts are parsed and tokenized. Each string token (which can be word, a part of word, or even a part of Unicode character) is replaced with a defined positive number. All internal special syntax is processed before this tokenization. 2) When embeddings are found (by using tokenized representation of their name, and then by searching the sequence of these token indices in your prompt), they are left as-is, just virtually taking space in the array according to each embedding's vector count. 3) Array of token indices is written into several 75-token parts, prefixing with special start token, ending with final token, and padded to 77 with pad token. So, if your prompt fits in 75 tokens, it will consume one full tensor with 77 elements (start+text+end). 4) If more than one part was created, they will be processed independently. WebUI splits your text at the last comma that will make current part fit to 75 (or hard-splitting if no convenient commas was found), this splitting is working just as implicit BREAK inserted in your prompt; thus by putting breaks yourself, you gain precise control of this splitting. 5) Each token index in each prompt part is replaced with its defined vector (which is also sometimes called as embedding). Those vectors have length = 768 for SD1 and length = 1024 for SD2. 6) Each TI embedding is replaced with its actual vectors too. So for now we have several prompt chunks with 77 vectors each, while those vectors are holding 768 or 1024 floating-point numbers. 7) Each prompt chunk is sent to CLIP (77x768) or OpenClip (77x1024 for SD2), to its "transformer" function. This is actual neural network that takes your vectors and transforms them. The output of final layer of CLIP model is again 77x768, so you have the same-shaped tensor but now filled with different numbers. 8) Originally, CLIP model was designed to project text-vectors and image-vectors to the same latent space. This is done by taking only one 1x768 vector from the last layer – the one on the position of the end-of-text token (so it "knows" everything before it too), and the output was just a single vector with 768 numbers. But StableDiffusion uses "frozen clip" approach, taking RAW neuron values instead of projected low-dimensional result. 9) Here ClipSkip comes into play: WebUI grabs either 77x768 from the last layer of the CLIP model, or from any specified previous layer. Thing is, the CLIP transformer has all layers with the same shape; but OpenClip (for SD2) does not, so clipskip is ineffective with it. Anyway, now we have prompt chunks with the same 77x768 (or 77x1024) shape as before, but they were transformed. 10) Each of 77 vectors is multiplied with its "attention" weight – this is what you set with parenthesis in prompt! Default weight is 1.0, and it is carried here by index in the original token array. So by using attention weight is parenthesis, you are multiplying transformed vectors on the positions of those tokens. 11) Prompt chunks are normalized after attention multiplication by multiplying on mean ratio. The comment is code states, "restoring original mean is likely not correct, but it seems to work well to prevent artifacts that happen". This reweights attention along the prompt chunk, so if you put large weight to one word – it will slightly lower the weight of all other words too, to get more balanced output. 12) Finally, all of those chunks are merged together and sent to cross-attention layers of U-Net. Actually the whole process is harder, because prompt-scheduling syntax requires to supply U-Net with different transformed prompts on different steps, but WebUI does not ever call CLIP after diffusion process was started – so it has to pre-compute and cache the correct crafted prompts ahead of generation.

Here are conclusions of above: 1) Parts of BREAK are sent to CLIP sequentially and cannot "see" each other. 2) Parentheses attention is applied after CLIP already saw everything, even if something was zeroed by (word:0) 3) Attention can be adjusted to each part of the BREAK statement independently. 4) EmbeddingMerge is working on TI embedding phase, before CLIP even starts! 5) ClipSkip has no effect on EmbeddingMerge (despite I'm resetting/restoring it during my calls – turns out it was unneeded). 6) EmbeddingMerge directly affects how CLIP will transform the prompt, but since it aggressively uses normalization, changing the multiplier slightly often has no real effect. 7) Merging with EM is not the same as merging with BREAK, since in case of EM there will be only one CLIP transformation instead of two separate. 8) You cannot have attention weight inside merge expression, but it's fine to use merge expressions inside attention parentheses.

Many things in machine learning are either "just work / magic", or "turns out".
For example, turns out we can read second-to-last layer from CLIP instead of the last layer – and it will just work.
Turns out we can multiply the output of transformer to put attention to the words on the same token position (despite everything was already transformed to mean something different).
Turns out, post-multiplying by zero does not eliminate the subject or style, but pre-multiplying embedding vectors is erasing the subject without hurting the composition and just works…

Personally, I prefer to put favorite artist names into null-attention block (X:0), to get the feeling of anatomy and composition without introducing oil-painting style or particular clothes/environment. I heavily use BREAK when inpainting to fix anatomy of a character – by leaving original prompt as-is but with added like BREAK ((detailed hand with accurate fingers)) – so the overall style will be inherited from what it was, while the attention will be on the specific body part.

I often use EM tab to inspect my prompt to make sure its read correctly, and to know the exact token count and meaning. Embedding merge syntax can be used to simplify X/Y Plot (Prompt S/R) with complex terms, like <'ruu_(tksymkw)'> – this is a particular nickname/tag of an artist on Danbooru, and good-luck to properly prompt him, especially since _( is literally one token…

I have a feeling that dividing overtrained TI embeddings on their mean (or better, choose the multiplier so that Std value will look more like normal one) might help them generalize, even during the training itself. There might be a lot of experiments ahead, that's why I made this extension!

Other than this, people aren't using EmbeddingMerge much. There are very few cases when you need to merge concepts instead of just describing them with words. It does not bind properties, for example you may try to join "long hair" and "blonde hair" into "<'long'+'blonde'> hair", but it may or may not be successful, while "long blonde hair" or "long_hair, blonde_hair" could give better results on different models.

BTW, I realized that syntax <'comment':0> (this is not multiplication, it is "set vector count to zero") can be used to comment-out a part of the prompt! Some people were seeking extension specifically for this.

Also I'm upset because the lack of extension queue management inside WebUI: for example, there is no way for the user (nor for the author of extension) to choose which extension should be applied first at each phase (prompt parsing, batch generation, U-Net hooking, etc.) This is why I had to ditch my original {'text'*0.5} syntax (changing it to <>) – just because it was conflicting with much popular sd-dynamic-prompts: https://github.com/adieyal/sd-dynamic-prompts/issues/212#issuecomment-1405465808

aleksusklim commented 1 year ago

Wow, something interesting: https://github.com/hnmr293/sd-webui-cutoff/issues/5