lllyasviel / stable-diffusion-webui-forge

GNU Affero General Public License v3.0
8.32k stars 809 forks source link

Slow generation on fp8 Q8 GGUF since last commits #1656

Closed BenDes21 closed 2 months ago

BenDes21 commented 2 months ago

Hi there, since the recent updates my generation is pretty slow on the quantized model fp8 and also the ouput is very blurry, seems to get some messages before the generation that I didnt get before :

To create a public link, set `share=True` in `launch()`.
Startup time: 15.1s (prepare environment: 3.4s, import torch: 5.7s, initialize shared: 0.2s, other imports: 0.4s, load scripts: 2.7s, create ui: 1.7s, gradio launch: 1.0s).
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
[GPU Setting] You will use 91.66% GPU memory (11257.00 MB) to load weights, and use 8.34% GPU memory (1024.00 MB) to do matrix computation.
Loading Model: {'checkpoint_info': {'filename': 'C:\\Users\\Admin\\Documents\\stable-diffusion-webui\\models\\Stable-diffusion\\flux1-dev-Q8_0.gguf', 'hash': 'b44b9b8a'}, 'additional_modules': ['C:\\Users\\Admin\\Documents\\stable-diffusion-webui-forge\\models\\text_encoder\\clip_l.safetensors', 'C:\\Users\\Admin\\Documents\\stable-diffusion-webui-forge\\models\\VAE\\ae.safetensors', 'C:\\Users\\Admin\\Documents\\stable-diffusion-webui-forge\\models\\text_encoder\\t5xxl_fp16.safetensors'], 'unet_storage_dtype': None}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
StateDict Keys: {'transformer': 780, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0}
Using Default T5 Data Type: torch.float16
Using Detected UNet Type: gguf
Using pre-quant state dict!
GGUF state dict: {'Q8_0': 304}
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'gguf', 'computation_dtype': torch.bfloat16}
C:\Users\Admin\Documents\stable-diffusion-webui-forge\modules_forge\patch_basic.py:38: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  result = original_loader(*args, **kwargs)
Model loaded in 32.1s (unload existing model: 0.1s, forge model load: 32.0s).
                                  [LORA] Loaded C:\Users\Admin\Documents\stable-diffusion-webui\models\Lora\kazumi_flux_ostris_000003000.safetensors for KModel-UNet with 494 keys at weight 1.0 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 13464.34 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 11034.05 MB, Model Require: 9569.49 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 440.55 MB, All loaded to GPU.
Moving model(s) has taken 12.89 seconds
Distilled CFG Scale: 3.5
Skipping unconditional conditioning (HR pass) when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 1024.00 MB for cuda:0 with 1 models keep loaded ... Current free memory is 1328.46 MB ... Done.
Distilled CFG Scale: 3.5
[Unload] Trying to free 17045.65 MB for cuda:0 with 0 models keep loaded ... Current free memory is 1323.47 MB ... Unload model JointTextEncoder Done.
[Memory Management] Target: KModel, Free GPU: 10972.08 MB, Model Require: 12119.55 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -2171.47 MB, CPU Swap Loaded (blocked method): 3461.62 MB, GPU Loaded: 8657.92 MB
Moving model(s) has taken 79.81 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [01:48<00:00,  4.34s/it]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2258.19 MB ... Unload model KModel Done.
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 10945.67 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: 9761.80 MB, All loaded to GPU.
Moving model(s) has taken 7.24 seconds
[Unload] Trying to free 1024.00 MB for cuda:0 with 0 models keep loaded ... Current free memory is 10785.33 MB ... Done.
Cleanup minimal inference memory.
tiled upscale: 100%|███████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00,  5.49it/s]
[Unload] Trying to free 11262.40 MB for cuda:0 with 1 models keep loaded ... Current free memory is 10705.97 MB ... Done.
[Unload] Trying to free 19920.08 MB for cuda:0 with 0 models keep loaded ... Current free memory is 10740.32 MB ... Unload model IntegratedAutoencoderKL Done.
[Memory Management] Target: KModel, Free GPU: 10901.64 MB, Model Require: 12119.51 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -2241.87 MB, CPU Swap Loaded (blocked method): 3576.38 MB, GPU Loaded: 8543.13 MB
Moving model(s) has taken 3.09 seconds
 10%|████████▎                                                                          | 2/20 [00:37<05:39, 18.88s/it]
Total progress:  62%|█████████████████████████████████████████                         | 28/45 [04:31<04:24, 15.58s/it]
Untitlecd

Would like to know where it's can possibly come from and how to fix, maybe a library to reinstall ?

Thanks

BenDes21 commented 2 months ago

Diffusion in Low Bits switch to Automatic (fp16 LoRA)