Open dermesut opened 1 month ago
Looking into this, it seems like:
forge_additional_modules
and try generating, yielding the errorIf I wrap forge_model_reload()
(where this error occurs) with a try Except block like this, I expected that it would resolve this issue.
However, the issue persists despite unload_all_models()
and clear_prompt_cache()
being called.
def manage_model_and_prompt_cache(p: StableDiffusionProcessing):
global need_global_unload
try:
p.sd_model, just_reloaded = forge_model_reload()
except Exception as e:
need_global_unload = True
memory_management.unload_all_models()
p.clear_prompt_cache()
need_global_unload = False
raise e
if need_global_unload and not just_reloaded:
memory_management.unload_all_models()
if need_global_unload:
p.clear_prompt_cache()
need_global_unload = False
I think this would be the correct place to resolve the issue, just struggling to figure out what exactly needs to be fixed when this error occurs...
If I simply put these print statements...
def manage_model_and_prompt_cache(p: StableDiffusionProcessing):
global need_global_unload
print("shared SD Model Pre Reload:", shared.sd_model)
try:
p.sd_model, just_reloaded = forge_model_reload()
except Exception as e:
print("shared SD Model After Error:", shared.sd_model)
raise
print("shared SD Model After Reload:", shared.sd_model)
Printed on successful generation:
shared SD Model Pre Reload: <modules.sd_models.FakeInitialModel object at 0x000002CD83F4A200>
...
shared SD Model After Reload: <backend.diffusion_engine.flux.Flux object at 0x000002CD5BDBFD30>
Printed on error:
shared SD Model Pre Reload: <modules.sd_models.FakeInitialModel object at 0x000002CD83F4A200>
...
shared SD Model After Error: None
Well, I found a solution, but I don't think this is the best solution because it seems like the idea is to flush as much information down the toilet as possible before loading models.
Using sd_model_backup = model_data.sd_model
Setting the model back if forge_loader()
fails.
def forge_model_reload():
current_hash = str(model_data.forge_loading_parameters)
if model_data.forge_hash == current_hash:
return model_data.sd_model, False
print('Loading Model: ' + str(model_data.forge_loading_parameters))
timer = Timer()
sd_model_backup = None
if model_data.sd_model:
sd_model_backup = model_data.sd_model
model_data.sd_model = None
memory_management.unload_all_models()
memory_management.soft_empty_cache()
gc.collect()
...
try:
sd_model = forge_loader(state_dict, additional_state_dicts=additional_state_dicts)
except Exception as e:
if sd_model_backup:
model_data.set_sd_model(sd_model_backup)
raise e
in manage_model_and_prompt_cache
, dumping/reinitialising the 'real' model seems to work.
try:
p.sd_model, just_reloaded = forge_model_reload()
except Exception as e:
# reincarnate the model
del sd_models.model_data
sd_models.model_data = sd_models.SdModelData()
raise
This is tested with a Schnell GGUF that gives a different error with missing modules (RuntimeError: Creating a Parameter from an instance of type ParameterGGUF ...
), which then leads to the same 'NoneType' object has no attribute 'sd_checkpoint_info'
.
@DenOfEquity Bravo! I tested this out, and this does resolve the issue.
You can omit the "as e
" and the log will be the same
In exactly one of my tests the next model load was extremely slow with lots of disk activity. Probably just because I'm using an old laptop (8GB vRAM, 16GB RAM) and was continuing to use other applications at the same time. Did you see anything similar? Otherwise, I think this is mergeable.
For me, it behaves just like a typical fresh start with nothing cached, which is a lot better than being stuck in a broken state. Not excessive load time.
However… maybe we could just make a super quick function that just looks at the model params and checks that everything it needs is there? Because it is basically trashing the current models before it simultaneously checks while loading models
I might have a chance to toy around with this idea tomorrow but I’ll be tied up all day today
Another edit - I imagine it’s set up like this because we can’t tell if the model has everything it needs baked in until it’s loaded? An idea to mitigate rebuilding from scratch may be to just append to a list/dict what the sd_model_checkpoint includes after it is loaded, and therefore on a subsequent load we have prior knowledge of what must be additionally included.
@dermesut @DenOfEquity I pushed a PR for how I think is the correct way to handle this situation.
Before model data is trashed and it tries loading the new data, this does minimum steps to check the inbound model data and raise errors before dumping current model data
once i get "you don't have state dict", i can't generate an image with the sd model that is set, even if i complete the state dict, due to "'NoneType' object has no attribute 'sd_checkpoint_info'" only after i change the sd model to something else, generate an image with that one, and then change back to my original sd model, i can generate an image again.
this is the process, that lets me reproduce that issue:
1) start of server
2) add vae/encoders one by one:
3) change flux-model:
4) click "generate":
5) delete vae/encoders one by one in order to force "you don't have state dict":
6) click "generate":
7) add back vae/encoders one by one:
8) click "generate":