Allow padding token to be a zero-vector embedding, proposing "0"

aleksusklim commented 1 year ago

Working on my extension Embedding Merge, I found out that if you zero-fill TI-embedding and replace a part of the prompt with it – then you will have the result completely without the affected concept, with almost no influence to other concepts in the prompt.

Actually, I implemented a pre-multiplier that you can add to any word to change it weight before the CLIP phase. So, by multiplying by zero you can get rid of anything, while still have "padding" for other words. From my experiments, such zero-padding is working better than padding with commas, exclamation signs, with pad_token and so on, especially when merging parts with different length (for primary purpose of my extension).

Would it be possible to implement zero-vector padding on your extension too? So we could compare, will it be better or not.

I propose putting the number "0" for padding token, since actually 0 stands for "!" while 256 stands for "!</w>" which is used each time the "!" parsed, so token #0 is impossible to enter in prompts anyway; using "0" in your extension the result is exactly the same as using "256". So token 0 is not effective currently, and can be redefined to produce zero-filled vectors.

hnmr293 commented 1 year ago

Thanks for your comment. Very interesting knowledge. Let me confirm my understanding is correct. I thought the implementation would be as follows:

replace the token which we want to cutoff the effect by ! (token number 0).
pass the replaced prompt to CLIP.
fill all elements of 768-dimensional (or 1024-dimensional) vector at the location of ! with 0.

Or, in step 1 above, without replacing the token to !, pass the prompt to CLIP as is? And zero-fills?

aleksusklim commented 1 year ago

No, you should zero-fill the vector on the position of your pad-token (which you can indeed hardcode as 0), instead of converting this token to its corresponding embedding vector.

I'm not quite sure how and when you are converting tokens to vectors (before passing them to CLIP), so you need to zero it just before passing to clip's transform/forward.

In my case, I've replaced target words with temporary TI Embedding created on-the-fly, filled with zeroes. So WebUI grabs it normally as any other embedding, and uses its contents instead of converting token indices to theirs vectors.

I believe, you don't mess with embeddings, so putting an actual TI embedding into your "padding token" would not work. Which means, this 0-case should be directly implemented somewhere in your code. (In worst case, if you don't have a handy control over initial token assignments, as the last resort you may seek for 1*768 vector just before CLIP call, which is equal to vectors from token #0; but again, your extension silently replaces token 0 with token 256 currently, so it will not work with 0 this way as-is).

aleksusklim commented 1 year ago

If you are doing this from "text" point of view (instead of "token index" point of view), this won't work. Because token 0 when converted to corresponding string, gives !, which is parsed back as token 256. There is no "text" which corresponds to token 0, neither to "zero-vector" (obviously). In my case, TI Embeddings can make any text to correspond to any vector, which simplified the work for me.

But if you firstly convert text to token indices, then (and only then) replace the target element with chosen padding token (which can be 0 and will stay 0) as number, and finally convert them to 768/1024 vectors before passing to CLIP – then catching #0 would be easy, since you'll have to zero-fill the resulting vector, which comes from shared.sd_model.cond_stage_model.wrapped.transformer.text_model.embeddings.token_embedding or shared.sd_model.cond_stage_model.wrapped.model.token_embedding.wrapped

hnmr293 commented 1 year ago

Thanks. I understood. You are talking about token_embedding in CLIP. And zero-fill is applied to token_embedding, not to the final outputs of CLIP. Right? I was only thinking of the inputs (prompts) and the final outputs (77x768 vectors). Sorry.

I implemented it in zerofill branch. It seems to be working well (see seed=50). zerofill branch

If you want to test this, please checkout zerofill branch and enable Zero-fills in Details.

I think it needs to be more tested whether it really works or doesn't, but if there are no errors, I will merge it into the main branch in a few days.

Thank you!

aleksusklim commented 1 year ago

Thank you! Tried it.

Full-body photo of blonde girl, with purple bow in her hair, dressed in red jacket, with yellow t-shirt, and green shorts, showing her black tights, in white shoes, standing on brown floor, on blue background. Detailed masterpiece.
Negative prompt: out-of-frame, cropped, close-up
Steps: 24, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 2188062044, Size: 512x768, Model hash: e04b020012, Model: rpg_V4

Columns: three seeds.

First row: original Second row: enabled at default settings Third row: checked Zero-fills

Result 1

![1](https://user-images.githubusercontent.com/13969463/226192608-c154525e-136e-4bbb-a1f3-50145148b790.jpg)

Changing weight to 1.0.

First row: Padding token = ! Second row: padding default Third row: checked Zero-fills

Result 2

![2](https://user-images.githubusercontent.com/13969463/226192620-2e890e9f-b6f7-42c6-aedf-10f146dd61b1.jpg)

Enabling Cutoff strongly, still at 1.0 weight.

First row: Padding token = ! Second row: padding default Third row: checked Zero-fills

Result 3

![3](https://user-images.githubusercontent.com/13969463/226192628-0512745c-b9da-4e13-a0aa-668fcc5df320.jpg)

For me it looks better, but not as best as everybody is wishing for. Do you have any insides of why the colors are still messed sometimes?

Is it related to CLIP transformation? Or is it related to general Stable Diffusion randomness in interpreting prompts?

Also, what is Cutoff strongly? I didn't quite understand it reading the code. And what actually is weight? You didn't explain it in https://github.com/hnmr293/sd-webui-cutoff/issues/5 (Is lerp/slerp is effective only when weight is not 0.0 nor 1.0 ?)

Finally, you've implemented zero-filling as stand-alone checkbox, which seems not right. Consider following:

When it is checked, the current padding token has no effect, despite still being exported in generate info.
It will confuse users when they will try changing settings randomly and creating unnecessary duplicates. At least, you should disable input filed after the user checks zero-filling.
You cannot compare zero-filling with any other token filling in X/Y-Plot because currently they are exposed as different settings, requiring both axes. That why I've made my grids manually, rather than with XYZ-Plot.

The most natural way is to use zero-filling when there is nothing typed in the input field (which you can make default to _), just as embedding initialization text is filled with zeros when default * is deleted: https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/d7aec59c4eb02f723b3d55c6f927a42e97acd679

But since you accept token indices as well at literal tokens (which means I already can't use "0" to mean literal 0 as text), and since 0 means the same as 256 (or you'll have to tweak the code to respect exact token indices instead of converting them to text), it will be way better to use 0 as zero-filling mechanics.

If you unsure about purely visual debug logging, I believe you may render it as _, as you do currently. Just print a line that zero-filling is in effect. Also, it worth renaming Padding token (ID or single token) to add something like put 0 to use zero-filling.

hnmr293 / sd-webui-cutoff

Allow padding token to be a zero-vector embedding, proposing "0" #10