lllyasviel / stable-diffusion-webui-forge

GNU Affero General Public License v3.0
7.71k stars 738 forks source link

[Bug]: Move model much longer with loras #782

Closed Super-zapper closed 2 weeks ago

Super-zapper commented 4 months ago

Checklist

What happened?

I have 3060 6gbvram. Generation of 1 image at SDXL takes about 19 sec. But if I add 2 LORAs it takes about 35 seconds and almost 20 of them takes a model moving. I am not sure if that is wrong or not, but I thought this increment is disproportional.

1

Steps to reproduce the problem

Just run txt2img with and without LORAs added

What should have happened?

I believe difference in generation time should be not so significant, about 2 times longer if I add 2 loras.

What browsers do you use to access the UI ?

Google Chrome, Brave

Sysinfo

sysinfo-2024-06-03-04-54.json

Console logs

Already up to date.
venv "E:\Forge SD\venv\Scripts\Python.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Version: f0.0.17v1.8.0rc-latest-276-g29be1da7
Commit hash: 29be1da7cf2b5dccfc70fbdd33eb35c56a31ffb7
CUDA 12.1
Launching Web UI with arguments: --api --xformers
Total VRAM 6144 MB, total RAM 16147 MB
WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
xformers version: 0.0.23.post1
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3060 Laptop GPU : native
Hint: your device supports --pin-shared-memory for potential speed improvements.
Hint: your device supports --cuda-malloc for potential speed improvements.
Hint: your device supports --cuda-stream for potential speed improvements.
VAE dtype: torch.bfloat16
CUDA Stream Activated:  False
Using xformers cross attention
ControlNet preprocessor location: E:\Forge SD\models\ControlNetPreprocessor
Using sqlite file: E:\Forge SD\extensions\sd-webui-agent-scheduler\task_scheduler.sqlite3
01:55:22 - ReActor - STATUS - Running v0.7.0-b7 on Device: CUDA
Loading weights [67ab2fd8ec] from E:\Forge SD\models\Stable-diffusion\ponyDiffusionV6XL_v6StartWithThisOne.safetensors
2024-06-03 01:55:22,653 - ControlNet - INFO - ControlNet UI callback registered.
model_type EPS
UNet ADM Dimension 2816
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 23.2s (prepare environment: 6.4s, import torch: 5.6s, import gradio: 1.1s, setup paths: 1.0s, initialize shared: 0.2s, other imports: 0.9s, load scripts: 4.2s, create ui: 1.0s, gradio launch: 0.4s, add APIs: 0.8s, app_started_callback: 1.4s).
Using xformers attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using xformers attention in VAE
extra {'cond_stage_model.clip_l.logit_scale', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids', 'cond_stage_model.clip_l.text_projection'}
To load target model SDXLClipModel
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  5120.6982421875
[Memory Management] Model Memory (MB) =  2144.3546981811523
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  1952.3435440063477
Moving model(s) has taken 0.82 seconds
Model loaded in 14.9s (load weights from disk: 0.9s, forge load real models: 12.5s, load textual inversion embeddings: 0.1s, calculate empty prompt: 1.2s).
To load target model SDXL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  5084.38427734375
[Memory Management] Model Memory (MB) =  4897.086494445801
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  -836.7022171020508
[Memory Management] Requested SYNC Preserved Memory (MB) =  3123.3725204467773
[Memory Management] Parameters Loaded to SYNC Stream (MB) =  1773.6767654418945
[Memory Management] Parameters Loaded to GPU (MB) =  3123.37158203125
Moving model(s) has taken 3.02 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 15/15 [00:16<00:00,  1.08s/it]
To load target model AutoencoderKL█████████████████████████████████████████████████████| 15/15 [00:13<00:00,  1.00s/it]
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  5055.88427734375
[Memory Management] Model Memory (MB) =  159.55708122253418
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  3872.327196121216
Moving model(s) has taken 1.15 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 15/15 [00:16<00:00,  1.09s/it]
[LORA] Loaded E:\Forge SD\models\Lora\xl_more_art-full_v1.safetensors for SDXL-UNet with 788 keys at weight 0.8 (skipped 0 keys)
[LORA] Loaded E:\Forge SD\models\Lora\Smooth Anime 2 Style SDXL_LoRA_Pony Diffusion V6 XL.safetensors for SDXL-UNet with 722 keys at weight 1.0 (skipped 0 keys)
[LORA] Loaded E:\Forge SD\models\Lora\Smooth Anime 2 Style SDXL_LoRA_Pony Diffusion V6 XL.safetensors for SDXL-CLIP with 264 keys at weight 1.0 (skipped 0 keys)
To load target model SDXLClipModel
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  4884.6904296875
[Memory Management] Model Memory (MB) =  2144.3546981811523
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  1716.3357315063477
Moving model(s) has taken 0.86 seconds
To load target model SDXL
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  5046.9013671875
[Memory Management] Model Memory (MB) =  4897.086494445801
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  -874.1851272583008
[Memory Management] Requested SYNC Preserved Memory (MB) =  3094.5395126342773
[Memory Management] Parameters Loaded to SYNC Stream (MB) =  1802.6171875
[Memory Management] Parameters Loaded to GPU (MB) =  3094.4311599731445
Moving model(s) has taken 70.16 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 15/15 [00:16<00:00,  1.09s/it]
To load target model AutoencoderKL█████████████████████████████████████████████████████| 15/15 [00:14<00:00,  1.00it/s]
Begin to load 1 model
[Memory Management] Current Free GPU Memory (MB) =  5038.4013671875
[Memory Management] Model Memory (MB) =  159.55708122253418
[Memory Management] Minimal Inference Memory (MB) =  1024.0
[Memory Management] Estimated Remaining GPU Memory (MB) =  3854.844285964966
Moving model(s) has taken 0.52 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 15/15 [00:15<00:00,  1.07s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 15/15 [00:15<00:00,  1.00it/s]

Additional information

In console log you will see first generation without any loras and move model takes 3 sec (that is first attempt, next will take 1 sec). Second generation is with 2 loras takes 70 sec to move model (it is first generation so it takes much longer, next ones will take 19 sec minimum)

BrickMissle commented 3 months ago

in your settings under extra networks, how many loras do you have set to be cached in memory?

Super-zapper commented 3 months ago

Well, it was 0, but now I tried 2 and this does not help. Also I tried to increase maximum number of checkpoints to be loaded and did not helped also.

unspokethetheme commented 3 months ago

https://github.com/lllyasviel/stable-diffusion-webui-forge/issues/693#issue-2266541153

ExperienceRO commented 2 months ago

Can totally confirm that, and worse, first time I use a LoRA in any prompt I have to close the console because it jumps from 20 secs to 20+ minutes. I need to keep the gradio open, then open forge again to gen normally. I'm using a i7 13th with 64GB ram and a 4070ti... First run without LoRA: image

Second Run with LoRA: image image

It's VERY annoying... ¬¬ Some optimization would be great as a solution..

Thank you.

BrickMissle commented 2 months ago

It's VERY annoying... ¬¬ Some optimization would be great as a solution..

In your prompt, change source_anime to source_furry. The lora may be stalling because it cannot find anthro data in the anime section of the checkpoint.

Unless you have the same problems with other loras, in which case I am not sure why your very powerful specs are taking so long to deal with a mere 159mb lora. It's probably an issue with forge, but lllyasviel has been updating it recently so don't worry I'm sure your PC will go from Slowpoke Rodriguez back to Speedy Gonzales very soon

ExperienceRO commented 2 months ago

It's VERY annoying... ¬¬ Some optimization would be great as a solution..

In your prompt, change source_anime to source_furry. The lora may be stalling because it cannot find anthro data in the anime section of the checkpoint.

Unless you have the same problems with other loras, in which case I am not sure why your very powerful specs are taking so long to deal with a mere 159mb lora. It's probably an issue with forge, but lllyasviel has been updating it recently so don't worry I'm sure your PC will go from Slowpoke Rodriguez back to Speedy Gonzales very soon

Thank you, but, for sure it's not a problem with the 'source' in prompt, even because the LoRA creator does use anime instead of furry, also the same doesn't happens on A1111 or ComfyUI, and it happens with any LoRA, first time it load, is a hell.. 😅😅😅

Hope it get solved soon, Forge is a way more faster than A1111 and a way more easy to work than ComfyUI... I'm used to use Comfy to XL models and SD3, but, I prefer the WebUI, it's more user friendly IMO. image

Super-zapper commented 2 weeks ago

After last updates this was solved, thanks to dev!