[Bug]: High VRAM cards still slower and moving models due to Lora networks

trihardseven commented 5 months ago

Checklist

[x] The issue exists after disabling all extensions
[ ] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[x] The issue exists in the current version of the webui
[x] The issue has not been reported before recently
[ ] The issue has been reported before but has not been fixed yet

What happened?

Initially for me the main a1111 was faster on a 3090, so I found that it as an issue with moving models, and when using the command line --always-gpu it made my gens 5 seconds faster on a 3090. I read that the issue of moving models was fixed, but when using the latest Forge version it's still moving loras and 2+ seconds slower in comparison than with the command on. The "Number of Lora networks to keep cached in memory" setting is active, so I'm guessing it's storing it in the RAM and moving to VRAM. I would just use the command, but the problem is that when I change full checkpoints it doesn't unload the previous one, forcing me to restart the WebUI when changing checkpoints 2-3 times.

If this isn't just a bug my solution would be adding a checkbox to this setting that says "Keep lora networks in VRAM" or maybe a setting like --always-gpu-no-checkpoint that keeps every model except for checkpoints in VRAM for users with high VRAM cards.

Steps to reproduce the problem

text to image using an SDXL lora
do it a few times with the same lora
notice it's always taking 2-4 seconds to move models
activate --always-gpu
text to image same prompt
notice after the first try it doesn't move lora anymore and is faster

What should have happened?

It should have kept a number of lora networks in the vram that matches the "Number of Lora networks to keep cached in memory" setting

What browsers do you use to access the UI ?

Brave

Sysinfo

sysinfo-.json

Console logs

venv "\Forge\stable-diffusion-webui-forge\venv\Scripts\Python.exe"
Python 3.10.9 () [MSC v.1934 64 bit (AMD64)]
Version: f0.0.12-latest-155-gd81e353d
Commit hash: d81e353d8928147bbd973068d0efbb2802affe0f
loading WD14-tagger reqs from \Forge\stable-diffusion-webui-forge\extensions\stable-diffusion-webui-wd14-tagger\requirements.txt
Checking WD14-tagger requirements.
Launching Web UI with arguments:
Total VRAM 24576 MB, total RAM 65446 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : native
VAE dtype: torch.bfloat16
I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING:tensorflow:\Forge\stable-diffusion-webui-forge\venv\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
\stable-diffusion-webui-forge\venv\lib\site-packages\pytorch_lightning\utilities\distributed.py:258: LightningDeprecationWarning: `pytorch_lightning.utilities.distributed.rank_zero_only` has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from `pytorch_lightning.utilities` instead.
  rank_zero_deprecation(
Using pytorch cross attention
ControlNet preprocessor location: \stable-diffusion-webui-forge\models\ControlNetPreprocessor
Tag Autocomplete: Could not locate model-keyword extension, Lora trigger word completion will be limited to those added through the extra networks menu.
[-] ADetailer initialized. version: 24.1.2, num models: 14
== WD14 tagger /gpu:0, uname_result(system='Windows', node='DESKTOP-39VISVT', release='10', version='10.0.19045', machine='AMD64') ==
Loading weights [821aa5537f] from 
 - ControlNet - INFO - ControlNet UI callback registered.
model_type EPS
UNet ADM Dimension 2816
Using pytorch attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using pytorch attention in VAE
extra {'cond_stage_model.clip_l.logit_scale', 'cond_stage_model.clip_l.text_projection', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids'}
Loading VAE weights specified in settings: 
To load target model SDXLClipModel
Begin to load 1 model
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 25.7s (prepare environment: 5.7s, import torch: 9.3s, initialize shared: 0.1s, other imports: 0.4s, list SD models: 0.5s, load scripts: 4.6s, create ui: 4.3s, gradio launch: 0.6s, app_started_callback: 0.1s).
Moving model(s) has taken 2.10 seconds
Model loaded in 8.5s (load weights from disk: 1.0s, forge load real models: 2.2s, load VAE: 0.1s, calculate empty prompt: 5.2s).
To load target model SDXLClipModel
Begin to load 1 model
unload clone 0
Moving model(s) has taken 2.09 seconds
To load target model SDXL
Begin to load 1 model
Moving model(s) has taken 2.73 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:09<00:00,  2.83it/s]
To load target model AutoencoderKL█████████████████████████████████████████████████████| 28/28 [00:09<00:00,  2.99it/s]
Begin to load 1 model
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  2.72it/s]
To load target model SDXL██████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  2.99it/s]
Begin to load 1 model
unload clone 1
Moving model(s) has taken 2.72 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:09<00:00,  2.97it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  2.80it/s]
To load target model SDXL██████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  2.97it/s]
Begin to load 1 model
unload clone 1
Moving model(s) has taken 2.78 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  2.78it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  2.65it/s]
To load target model SDXL██████████████████████████████████████████████████████████████| 28/28 [00:10<00:00,  2.63it/s]
Begin to load 1 model
unload clone 1
Moving model(s) has taken 2.79 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.31it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:09<00:00,  3.09it/s]
To load target model SDXL██████████████████████████████████████████████████████████████| 28/28 [00:09<00:00,  3.28it/s]
Begin to load 1 model
unload clone 1
Moving model(s) has taken 2.54 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:08<00:00,  3.26it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 28/28 [00:09<00:00,  3.06it/s]
To load target model SDXLClipModel█████████████████████████████████████████████████████| 28/28 [00:09<00:00,  3.30it/s]
Begin to load 1 model

Additional information

This bug might be happening because I have two gpus, not sure if everyone is having this issue. The gpu being used is the 3090 though, as you can see in the console.

tusharbhutt commented 5 months ago

I am having similar issues with loading checkpoints. The checkpoints do not seem to be retained in VRAM despite the settings and are loaded every single batch run. With four checkpoints on a 3060 12GB model, this takes three minutes each time for the initial step, and then three more for the refiner.

The program should retain the checkpoints and not load them every single time. Even with one checkpoint on the device at one time, and selecting to keep two checkpoints, total render time for four 1024x1024x40 step images is now almost four minutes, double what it was a week ago.

catboxanon commented 5 months ago

Is there a reason --always-gpu is not a satisfactory resolution? If you want the most speed absolutely possible, and have a high VRAM card to support it, that is the best option to use. Adding something like --always-gpu-no-checkpoint is confusing and is just going to slow you down in other ways.

tusharbhutt commented 5 months ago

~~In my case, uninstalling everything and reinstalling seems to have fixed things for now.~~

Nope... problem is back again. Seems to be one of the extensions, so will need to kill hem one by one.

trihardseven commented 5 months ago

Is there a reason --always-gpu is not a satisfactory resolution? If you want the most speed absolutely possible, and have a high VRAM card to support it, that is the best option to use. Adding something like --always-gpu-no-checkpoint is confusing and is just going to slow you down in other ways.

@catboxanon the problem with --always-gpu is that it doesn't unload checkpoints, and I tend to swap them a lot and do xyz plots with multiple checkpoints for testing. Because it never unloads these full checkpoints I run out of vram even on my pc. If you think that idea is confusing I think just adding a keep LoRA networks in vram option is more than enough.

tusharbhutt commented 5 months ago

Maybe it's related to the upstream DEV version of AUTO1111?

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/14912

BadisG commented 4 months ago

I second this, there should be a way to keep the model to your VRAM until you switch to another model

Edit: Don't use any flags and use those settings, and you'll get what you want.

Edit2: It won't work for controlnet models, especially InstantID because you need to use 3 models (the main one + ip-adapter + control_instant_id) at the same time. You get this annoying "Unload clone" thing and I don't know how to get rid of it

trihardseven commented 4 months ago

I second this, there should be a way to keep the model to your VRAM until you switch to another model

Edit: Don't use any flags and use those settings, and you'll get what you want.

Edit2: It won't work for controlnet models, especially InstantID because you need to use 3 models (the main one + ip-adapter + control_instant_id) at the same time. You get this annoying "Unload clone" thing and I don't know how to get rid of it

This has no effect on LoRAs though

tusharbhutt commented 4 months ago

I second this, there should be a way to keep the model to your VRAM until you switch to another model

Edit: Don't use any flags and use those settings, and you'll get what you want.

I have those settings and the latest version (pulled from AUTO 1.80RC) is still slow.

edtjulien commented 4 months ago

Same issue here. I had those issues in Fooocus and now in Forge. With 3090 and 24 Go. Checkpoint, Lora and Controlnet caching work perfectly fine on Automatic1111. I tried everything wth params (--always-gpu, and cache options in config).

Dfafb commented 4 months ago

I have the same problem on 3060(12 GB of memory), it all started as I update all installed extensions and Lora started slowing down image generation even on 1.7.0. P.s Seems to have found an extension that slows down generation twice for me. https://github.com/KohakuBlueleaf/a1111-sd-webui-lycoris

Grokstreet commented 3 months ago

I'm having the same issue on a 4090. Adding --always-gpu dropped the initial 'moving model' time from ~3 seconds to ~1 second but I shouldn't need to reload a model between every image, no? I suspect this is Adetailer loading/unloading

lllyasviel / stable-diffusion-webui-forge