Add option to keep model in VRAM instead of unloading it after each generation

Dampfinchen commented 3 months ago

I am running the FP16 version of Flux and the fp16 T5 text encoder on my RTX 2060 laptop with 32 GB RAM. I was surprised to see WebUI forge having faster speeds by multiple magnitudes compared to Comfy (11 minutes vs 2 minutes), so great job on the optimization here, @lllyasviel !

However, running it in FP16 is really tight on my RAM as well so loading parts of the model into VRAM takes quite a bit of time. When I press generation, moving models adds around 1 minute to the generation time.

So it would be really cool to have an option to turn this behavior off. Once loaded, the model should stay in VRAM until I close the program. This way there's be no moving models process between generations and it would speed up the experience up by a lot. Please consider it.

Iory1998 commented 3 months ago

Is it due to the T5-xxl? I was looking for an option to keep model in VRAM in settings when I came across this message:

I wish there was an option to choose which models to keep in VRAM and which to offload to RAM. I have 24GB of VRAM, and with GGUF-Q8, my VRAM usage is never full, yet each time I change the prompt, I see the VRAM unloading then loading. Sometimes it gives me the OOM error, which I have to hit the Generate button multiple times before it starts working again.

I noticed sometimes that the VRAM usage varies randomly, sometimes its around 14GB and sometimes its 22GB. Once it shot passed the VRAM into the Shared GPU Memory. This led me to think that maybe models are loaded randomly into VRAM and depending one whether the Unet is first loaded or last, OOM can happen.

Iory1998 commented 3 months ago

I just used ComfyUI, it seems that models are now kept in VRAM and generation with same prompt takes 30s for me while changing the prompt takes 47.47s (AR: 832x1216, Steps:20, commit:bb222ce). I am using the fp16 and I have an RTX3090. It's likely a matter of time before @lllyasviel implement it.

andy8992 commented 3 months ago

I just used ComfyUI, it seems that models are now kept in VRAM and generation with same prompt takes 30s for me while changing the prompt takes 47.47s (AR: 832x1216, Steps:20, commit:bb222ce). I am using the fp16 and I have an RTX3090. It's likely a matter of time before @lllyasviel implement it.

Indeed this is very frustrating, I only have 8gb but even when I do have enough it seems to load an unload a lot

Iory1998 commented 3 months ago

I just used ComfyUI, it seems that models are now kept in VRAM and generation with same prompt takes 30s for me while changing the prompt takes 47.47s (AR: 832x1216, Steps:20, commit:bb222ce). I am using the fp16 and I have an RTX3090. It's likely a matter of time before @lllyasviel implement it.

Indeed this is very frustrating, I only have 8gb but even when I do have enough it seems to load an unload a lot

Use the GGUF Q8 if you can, it's 99% identical to the FP16. If you can't, try the Q6 version and the T5xxl fp8, which takes half the size of the VRAM. Remember, you need an equivalent in VRAM of the Unet + CLip+T5xx+LoRA+ControlNet model.

andy8992 commented 3 months ago

Ah I just meant I was getting constant model movement even in xl

tazztone commented 3 months ago

also found this unpredictable behavior of loading/unloading of models on my 3090. maybe some checkbox to "keep model in vram" would help?

mase-sk commented 3 months ago

Yes, that model offloading is pretty annoying and it's not stable od rtx 3060 12gb and 32gb ram. I'm searching for a way how to turnoff offloading.

Arnaud3013 commented 2 months ago

The issue is in memory_management. Line 621.

if loaded_model in current_loaded_models:

the loaded_modelis always different that what is in current_loaded_models. current_loaded_models-> <backend.memory_management.LoadedModel object at 0x000001C72885BC10> to load= <backend.memory_management.LoadedModel object at 0x000001C72C08ADD0> Maybe some hash usage could solve that, i'v spend some time on that, but i don't know enough classes, was trying to check like in sd_models with if model_data.forge_hash == current_hash: but i've never found equivalent. hope @lllyasviel could fix it, it should be quite easy with those information ?

Iory1998 commented 2 months ago

The issue is in memory_management. Line 621.
if loaded_model in current_loaded_models:
the loaded_modelis always different that what is in current_loaded_models. current_loaded_models-> <backend.memory_management.LoadedModel object at 0x000001C72885BC10> to load= <backend.memory_management.LoadedModel object at 0x000001C72C08ADD0> Maybe some hash usage could solve that, i'v spend some time on that, but i don't know enough classes, was trying to check like in sd_models with if model_data.forge_hash == current_hash: but i've never found equivalent. hope @lllyasviel could fix it, it should be quite easy with those information ?

I agree. I notice long weeks ago my Vram usage would vary drastically (from 12GB to 24+ and use shared memory) to generate the same exact prompt. I could see that Memory Management tries to load and unload models until they fit somehow. I deduced from that that the models get loaded randomly into Vram. I think that's why each time I change LoRa, my Vram gets unloaded entirely and get loaded again. Why would you unload the Flux.1 model and text encoders too?

tazztone commented 2 months ago

since the last couple of changes it seems to have gotten better. i can now load flux q8 and t5 q8 and keep it in vram between generations

Arnaud3013 commented 2 months ago

Still having issue on my side. I've done a git pull and a fore reset head, no change. Always 0 models to keep loaded.

which version are you using? Did more tests, in SDXL no issue, it is just with Flux (NF4 or Q4)

lllyasviel / stable-diffusion-webui-forge

Add option to keep model in VRAM instead of unloading it after each generation #1245