klimaleksus / stable-diffusion-webui-embedding-merge

Extension for AUTOMATIC1111/stable-diffusion-webui for creating and merging Textual Inversion embeddings at runtime from string literals.
The Unlicense
110 stars 10 forks source link

[Question] Difference words vs embeddings #9

Open miasik opened 9 months ago

miasik commented 9 months ago

I have such phrases in my negative field: low quality, low resolution so the word "low" is presented twice there.

is there a difference if i convert them to inline embeddings like these: <'low' + 'quality'>, <'low' + 'resolution'>

aleksusklim commented 9 months ago

Yes, there is a big difference.

Each prompt (line of text) is first converted to tokens (array of integers) and those tokens are converted to embeddings (array of elements where each element it itself a vector of floats). This is done straight-forward, and this is where Embedding Merge is working: it adds or multiplies those embedding vectors, not words or their tokens.

But! Those embeddings (representing all words of the prompt) aren't fed to Stable Diffusion directly. Instead, CLIP or OpenClip transformer network is used to recalculate this two-dimensional array.

It is CLIP who "understands" what the text means. Numbers are changed drastically, no longer representing simple words but their meanings.

Transformed array represents the high-level prompt ready to be sent to U-Net of Stable Diffusion. And this is where two other controlling methods are used: prompt weighting and prompt merging.

When you write (green) hair – you are not increasing just the "word's" weight, you are changing the weight of vectors that were outputted by CLIP: they also contain positional information, semantic relations, and could have been influenced by ClipSkip.

When your prompt is longer than 75 tokens, or if you put BREAK explicitly – your prompt is split, and its parts are transformed with CLIP independently of each other. Then you will have several valid "prompts" (and each of them can be partially weighed independently).

Before sending them to Stable Diffusion, those parts are summed by elements, so each vector becomes a sum of corresponding vectors, each of which was already transformed with CLIP.

So here you are not merging words, but their meaning. green hair BREAK blue eyes becomes something that is simultaneously means both "green hair" and "blue eyes". (Which doesn't prevent SD to generate blue hair with green eyes, because wrong properties bindings in an inherent problem, both in U-Net and in the CLIP itself!)

EmbeddingMerge works at much lower level, merging stuff at "words", before CLIP. This means that merged parts change their properties, no longer representing of what it was.

<'green hair'+'blue eyes'> is the same as <'blue hair'+'green eyes'> or <'green'+'blue'><'eyes'+'hair'>, and at the end we will see what CLIP thinks it is. So probably the first word is a color, and the second word is a part of the face.

On the other hand, <'green'+'hair'> is something different, meaning both a color and an object. Unfortunately, this doesn't anyhow help CLIP or SD to separate or localize objects and their properties together.

The importance of CLIP it huge: it transforms groups of words together, and their meaning may change. In your example, low quality is a concept of bad generation, while low and quality mean different things. By putting low in the negative prompt, what it would actually negate? Will it make buildings taller? Worse with quality: don't you want a concept of "quality" to be positive, not negative?

So what I see is <'low'+'resolution'> being something that means both "low" and "resolution" simultaneously, but not "low resolution". On the other hand, <'bad'+'low'><'quality'+'resolution'> might work more or less as expected (just be sure to check token lengths of your vectors to account for alignment)

Still, CLIP tends to understand even messed-up concepts, so <'eyes'+'blue'> might work too, and my extension has more research purpose rather than a practical one.

miasik commented 9 months ago

Perfect! I couldn't imagine that I would get so wide answer! Tokens length is my second question related to my favorite trick with faces mixing. It works as a charm for usual [name1 | name2] but mixing tokens is unclear for me. For example: Laura Vandervoort, Katheryn Winnick have 4 vs 5 parts изображение Should I do something more than just <'Laura Vandervoort' + 'Katheryn Winnick'> to make the mix working correctly?

aleksusklim commented 9 months ago

By default, shortest string is padded with zero vectors. The side effect is that the "amount" of information there is low.

In your example, adding 4 tokens text with 5 tokens text will give you 5 tokens where first 4 are merged (and thus have double-length unless you put =/2 at the end) while the last token is unmodified from the second text (which would be halved in length if you would go for =/2 at the end)

Good news is that, firstly, absolute vector length (in Cartesian sense) is not too important, SD tolerates in 0.5>=X>=3 just fine: a half of dog or thrice a dog is still a dog; and secondly, zero-vectors are not messing up general concept understanding, and their addition has even less artifacts than putting extra commas here and there.

I heard BREAK gives a very good person identity merging, like your main prompt BREAK person1 BREAK person2 BREAK masterpiece etc Sometimes you would need to accommodate for alignment too, if you repeat the same prompt in those parts but with changed subject.

miasik commented 9 months ago

So much information, so hard to get it immediately. You have an example: 'kuvshinov' + 'kuvshinov':-1 + 'kuvshinov':-2 + 'kuvshinov':-3 =: 1 As I understand in the example you make a single vector from a complex last name. If we have such the example, it has some sense. What is the sense? Should I covert all my complex names into single vectored ones like this? 'Vandervoort' + 'Vandervoort':-1 + 'Vandervoort':-2 =: 1

aleksusklim commented 9 months ago

As I understand in the example you make a single vector from a complex last name.

And it gave nothing! Concepts are destroyed by taking their intermediate tokens. (Example just showed how to do it, not that it will be useful)

Should I covert all my complex names into single vectored ones like this?

You can have more luck with <'first name'+'last name'>, but probably not either. What do you want to achieve? BREAK is better both at merging and shortening prompts, so you can describe a character and the scene separately, for example.

One of practical applications of EM is just making chimeras out of simple objects (as I showed in the linked Discussion), it can be fun.

But even then, my preliminary tests with SDXL are showing, for example, <'cat'*X+'girl'*Y> generates ether cat (X≈1, Y≈0.5), either girl (X<0.5), or a girl with a cat (X==Y), but not a catgirl! I got kids with feline ears on very specific ranges like X=0.87, really unstable and seed-dependent.

miasik commented 9 months ago

I can't use typical mixing by [ | | ] because civiai automoderator reads prompts and sends such images to a long queue. Using EM allows me to avoid such checking and I wonder for information about EM to reach the same visual effect as [ | | ] has.

aleksusklim commented 9 months ago

Can't you just [ <'one'> | <'two'> | <'three'> ] ?

miasik commented 9 months ago

Sure, I can. As I've already checked it works the same. Just like to know something new, something useful ;-)

miasik commented 9 months ago

Actually joining vectors and word switching work differently. A disadvantage of [|] is in the persons. Each step each current person tries to change not only the face but the whole image. <''> + <''> has more "healthy" behavior and as the result the final image might be more "consistent".

miasik commented 9 months ago

One more thing to keep in mind. All parts inside [|] have different wights by its nature, but inside <''+''> their weights are the same. So if I have a good face made by [|] I can't get the same just by replacing the constructions, I have to play with weights inside <''+''>

aleksusklim commented 9 months ago

Have you tried BREAK? a female model Cameron Diaz BREAK a female model Lucy Liu or a female model BREAK Cameron Diaz BREAK Lucy Liu (You can still hide words with EM synax if needed)

miasik commented 9 months ago

BREAK? Why do I need to use it? I want to mix faces, not to separate

aleksusklim commented 9 months ago

Try it!

miasik commented 9 months ago

Try it!

Actually I had used it before rarely. This is my fresh work with it https://civitai.com/images/5744964

miasik commented 9 months ago

Try it!

I must say that you've showed me a way to use it and I'm going to use it more often. Thanks again! ;-)

aleksusklim commented 9 months ago

I'm going to use it more often.

Those who can use BREAK are often wondering why nobody else are using such power!