[Issue]: 'key "cond_stage_model.logit_scale" not found in TensorDict with keys' error after trying to merge a merged model

Xeltosh commented 10 months ago

Issue Description

I did some merging of some of my models i use and when trying to merge one of my already merged models, the UI gives me the in the topic mentioned error with an extremely long text (on the right side, which is only readable if copied and pasted in a separate file). The "bad" part is, in the console is no error visible. for the console, the merging just gets interrupted and stopped.

merging setup:

happened in weighted_sum and sum_twice

weightsclip enabled ReBasin enabled with standard seting of 5

I will post the rest of the error in the UI after i get home, because i need to do some formating, so it is readable (it's all written in one line)

My suspicion is, that there is some kind of minor error in one of the models i use, which only appears after merging, but i don't know how to fix this or find this error. Could it be an Error because of my "low" RAM? I will also trying to do the merge on my AMD system later which has more RAM.

I can provide the model informations if needed, though one is a sfw and nsfw model a friend of mine trained. The other one is on Civit.AI

Version Platform Description

WIndows 10 GTX3050 16GB DDR-4 RAM SD.Next is run via StabilityMatrix

Relevant log output

No response

Backend

Original

Branch

Master

Model

SD 1.5

Acknowledgements

[X] I have read the above and searched for existing issues
[X] I confirm that this is classified correctly and its not an extension issue

Xeltosh commented 10 months ago

here the promised additional data. Tested merging it on AMD with 32GB of RAM, same error.

command line: grafik

and here the error, the UI gives in one line on the right side of it: error message.txt

AI-Casanova commented 10 months ago

I'll take a look through the pruning logic. Keys that are in one model only should be pruned before merging and returned intact after.

Xeltosh commented 9 months ago

Update: after testing around to pinpoint the error i also tried pruning the ingredients. pruned 2 of the 3 models and after that i could make my planned merge. Ok, i thought, you found the error but trying another different recipe after that still produced the error.

i am doing a sum_twice and i want to merge the result of that with another model.

though i saw something else right now......when merging and using sum_twice first, the cmd shows 3 models are getting loaded. when switching to weighted_sum after that, it still shows that it loads 3 models, even if only 2 are chosen. Restarting SD.next and doing weighted_sum it loads 2 models.

though even restarting the UI and doing every merge after another breaks after the first successful merge. can't really describe it, but even if there is an slight error in the model somewhere, it shouldn't lead to the whole merging process to be breaking down without any useable information(at least for me)

grafik

Also i don't know if it is connected to the issue, but every time i get 885 keys, the merging breaks grafik

AI-Casanova commented 9 months ago

@Xeltosh Sorry its taken me so long, but would you be able to pull https://github.com/vladmandic/automatic/pull/2748 to see if that is sufficient to solve your problem?

vladmandic commented 9 months ago

code is merged in dev branch.

Xeltosh commented 9 months ago

in case of sounding stupid, but i loaded the dev branch via Stability matrix and do i have to choose something specific? because everything the new code did was changing the merged result, though there was another update yesterday to it. did this somehow overwrite the code? the resulting model is different than before, but the error at 885 keys still happens.....

AI-Casanova commented 9 months ago

What commit is it showing when you first start the server?

Xeltosh commented 9 months ago

02:38:05-010323 INFO Logger: file="C:\StabilityMatrix\Packages\SD.Next Web UI dev\sdnext.log" level=INFO size=191754 mode=append 02:38:05-012318 INFO Python 3.10.11 on Windows
02:38:07-340443 INFO Version: app=sd.next updated=2024-01-25 hash=e924cc9e

To be fair, atm i can't seem to get the error in dev-branch, though i definitely had it once this afternoon, but it doesn't shows up atm. After the error appeared, i tried the merges i had done so far and found out, that the merge is definitely different than the same recipe done in the main branch, and sadly in my opinion, the resulting model got worse than being done in main. i tried pruning the model in dev-branch and merge again in main-branch. the resulting model is the one i want, sadly i can't merge anything on top, because the error still happens, strangely still at keys 885. tried just now after writing the message before, if i can get the error again and make a short movie maybe but the error doesn't appear anymore....

As apparantly no one besides me has this error, it is most likely, that the model my friend did has some kind of error in it. Would it help you, if i gave you a link to it? i would like to still have the old merging logic, but also my model to be fixed :/

i will try something different, maybe merging the model with itself in dev-branch could fix the model without changing the content?

Xeltosh commented 9 months ago

update: merging itself in dev-branch doesn't fix the merging in main.

What i don't understand is: why can i do 1-3 merges before it breaking every merge afterwards.....

Xeltosh commented 9 months ago

another update: after playing around with the merges done in dev-branch, i found out, that it works just fine and as good as the old ones (just different) and one or more of my embeddings "destroyed" the pictures i generated.

Though figuring out why there is a difference would still be nice, but is not that important anymore. Still thx for helping me @AI-Casanova

Xeltosh commented 8 months ago

@AI-Casanova ok, another update: my friend trained a new model via kohya and i wanted to merge again. with the ReBasin active, i get the aforementioned error immediately again. i tried other models and i also couldn't merge them.

BUT as soon as i deactivate the rebasin, it works. when googling, someone mentioned in comfyUI it is a debug message and can be ignored, though dunno if it is completely the same: LINK and LINK

Soooo.... maybe a handler for that message is missing? Because as described before, the message only appears in the browser and apparantly is connected with the rebasin. the console just stops working without saying anything and waits for new input grafik grafik

Xeltosh commented 8 months ago

sry for writing again, but i don't know if you get an notification in a closed issue @AI-Casanova @vladmandic

tried merging some other models and when keys appear at 883, the merging works, like before again. also installed a standalone version of SD.Next, because i thought that maybe stability matrix has a bug with the ReBasin, but sadly it's not. merging the models itself or trying to convert them and fixing the clip didn't work either

Stax124 commented 8 months ago

We resolved the issue, problematic tensors were cond_stage_model.logit_scale and cond_stage_model.text_projection

I made a script in case someone encounters this as well and wants to "fix" their model:

from argparse import ArgumentParser

from safetensors import safe_open
from safetensors.torch import save_file

parser = ArgumentParser()
parser.add_argument("input", type=str, help="Input file")
parser.add_argument("output", type=str, help="Output file")
args = parser.parse_args()

# Load the model
model = safe_open(args.input, framework="pt")
tensors = {}
for key in model.keys():
    tensors[key] = model.get_tensor(key)

# Remove broken tensors
del tensors["cond_stage_model.logit_scale"]
del tensors["cond_stage_model.text_projection"]

# Save the fixed model
save_file(tensors,args.output)

Xeltosh commented 8 months ago

yep, can confirm! tried it on my models and i have 883 keys while merged and it ran through without error

grafik

vladmandic commented 8 months ago

my $0.02 without digging real deep, cond_stage_model.text_projection sounds weird to start with as which encoder is it referring to? imo, it should be something like cond_stage_model.clip_l.text_projection or cond_stage_model.clip_g.text_projection

Xeltosh commented 8 months ago

we have no idea where these 2 keys come from. They are somehow in there after training. They either come from the used basemodel which was used while training OR they come to some buggy script in the training software. I tried merging several models before using Stax's script (without regarding the content compability). The keys in the command-line are the best indicator to see, if the model has these buggy things or not.

Some models seem to have them and only these specific models have the merging bug i encountered with ReBasin. As i shared before, every time i had 885 keys to merge, it bugged out, with 883 keys it worked.

The script from Stax removes these 2 "faulty" keys and i can say, that everything works so far as it is intended

vladmandic / automatic