GGUF Q8_0 and Automatic Diffusion (8bit LoRa) generating poor quality

blakejrobinson commented 2 months ago

Since ba01ad37, LoRas loaded in 8bit to the Q8_0 GGUF generate to a poor quality. Loading the LoRa in 16bit appears to fix this issue, but there are subtle differences in the generations from rounding.

This does not seem to happen with FP8 safetensor or with the NF4 - just the Q8_0 GGUF. This does not happen with checkpoint ba01ad37 and earlier.

Example at commit 3b9b2f65: <lora:flux_Gen_5_Trainer_Sprites:1> A pixelart drawing of a chicken

Diffusion in Low bits set to Automatic: tmpdhprdqqr

Diffusion in Low bits set to Automatic (FP16 Lora) tmpez8qh4ed

Example at commit ba01ad37 (last testable commit before 8bit LoRa changed):

Diffusion in Low bits set to Automatic: tmpw1eub1xa

Diffusion in Low bits set to Automatic (FP16 Lora): tmpfoodnrpo

Generations with the FP8 safetensor for comparison:

Diffusion in Low bits set to Automatic: tmpyjgjvqxv

Diffusion in Low bits set to Automatic (FP16 Lora): tmpmhzlqzzk

And here is the NF4 for comparison:

Diffusion in Low bits set to Automatic: tmpmaz2i3rl

Diffusion in Low bits set to Automatic (FP16 Lora): tmph9qotj7_

(LoRa used here was https://civitai.com/models/704779/flux-gen-5-trainer-sprites but appears to happen with all LoRas)

lllyasviel commented 2 months ago

update and try again

blakejrobinson commented 2 months ago

Still occuring in 69ffe37f 00013-2738534561

Log here just in case it helps:


initial startup: done in 0.023s
  prepare environment:
  checks: done in 0.008s
  git version info: done in 0.091s
Python 3.10.9 (tags/v3.10.9:1dd9be6, Dec  6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-494-g69ffe37f
Commit hash: 69ffe37f147660f90783d1d39ac9d62d8661cb73
  torch GPU test: done in 1.924s
  clone repositores: done in 0.097s
    run extensions installers:
    adetailer: done in 0.148s
    sd-webui-lobe-theme: done in 0.001s
    sd-webui-pixelart: done in 0.000s
CUDA 12.4
    sd-webui-reactor: done in 2.047s
    run extensions_builtin installers:
    extra-options-section: done in 0.001s
    forge_legacy_preprocessors: done in 0.299s
    forge_preprocessor_inpaint: done in 0.001s
    forge_preprocessor_marigold: done in 0.000s
    forge_preprocessor_normalbae: done in 0.000s
    forge_preprocessor_recolor: done in 0.000s
    forge_preprocessor_reference: done in 0.000s
    forge_preprocessor_revision: done in 0.001s
    forge_preprocessor_tile: done in 0.000s
    forge_space_animagine_xl_31: done in 0.000s
    forge_space_birefnet: done in 0.000s
    forge_space_example: done in 0.000s
    forge_space_florence_2: done in 0.000s
    forge_space_geowizard: done in 0.000s
    forge_space_iclight: done in 0.000s
    forge_space_idm_vton: done in 0.001s
    forge_space_illusion_diffusion: done in 0.000s
    forge_space_photo_maker_v2: done in 0.000s
    forge_space_sapiens_normal: done in 0.000s
    mobile: done in 0.000s
    prompt-bracket-checker: done in 0.000s
    ScuNET: done in 0.000s
    sd_forge_controlllite: done in 0.000s
    sd_forge_controlnet: done in 0.295s
    sd_forge_dynamic_thresholding: done in 0.000s
    sd_forge_fooocus_inpaint: done in 0.000s
    sd_forge_freeu: done in 0.000s
    sd_forge_ipadapter: done in 0.001s
    sd_forge_kohya_hrfix: done in 0.000s
    sd_forge_latent_modifier: done in 0.000s
    sd_forge_lora: done in 0.000s
    sd_forge_multidiffusion: done in 0.000s
    sd_forge_neveroom: done in 0.001s
    sd_forge_perturbed_attention: done in 0.000s
    sd_forge_sag: done in 0.000s
    sd_forge_stylealign: done in 0.001s
    soft-inpainting: done in 0.000s
    SwinIR: done in 0.000s
Launching Web UI with arguments: --log-startup --listen
launcher: done in 0.002s
Total VRAM 24564 MB, total RAM 65423 MB
pytorch version: 2.4.0+cu124
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
Using pytorch cross attention
Using pytorch attention for VAE
import torch: done in 6.113s
import torch: done in 0.591s
import gradio: done in 0.000s
setup paths: done in 0.004s
initialize shared: done in 0.138s
other imports: done in 0.572s
Running with TLS
TLS: done in 0.002s
opts onchange: done in 0.000s
setup SD model: done in 0.000s
setup codeformer: done in 0.002s
setup gfpgan: done in 0.010s
set samplers: done in 0.001s
list extensions: done in 0.009s
restore config state file: done in 0.001s
list SD models: done in 0.018s
list localizations: done in 0.002s
  load scripts:
  custom_code.py: done in 0.009s
  img2imgalt.py: done in 0.001s
  loopback.py: done in 0.000s
  outpainting_mk_2.py: done in 0.001s
  poor_mans_outpainting.py: done in 0.000s
  postprocessing_codeformer.py: done in 0.000s
  postprocessing_focal_crop.py: done in 0.008s
  postprocessing_gfpgan.py: done in 0.001s
  postprocessing_upscale.py: done in 0.000s
  prompt_matrix.py: done in 0.000s
  prompts_from_file.py: done in 0.001s
  sd_upscale.py: done in 0.000s
  xyz_grid.py: done in 0.002s
  scunet_model.py: done in 0.414s
  swinir_model.py: done in 0.053s
  extra_options_section.py: done in 0.001s
  legacy_preprocessors.py: done in 0.015s
  preprocessor_inpaint.py: done in 0.181s
  preprocessor_marigold.py: done in 0.012s
  preprocessor_normalbae.py: done in 0.007s
  preprocessor_recolor.py: done in 0.000s
  forge_reference.py: done in 0.001s
  preprocessor_revision.py: done in 0.000s
  preprocessor_tile.py: done in 0.001s
  forge_controllllite.py: done in 0.012s
ControlNet preprocessor location: F:\Software\AI\stable-diffusion-webui-forge\models\ControlNetPreprocessor
  controlnet.py: done in 0.934s
  xyz_grid_support.py: done in 0.000s
  forge_dynamic_thresholding.py: done in 0.004s
  forge_fooocus_inpaint.py: done in 0.001s
  forge_freeu.py: done in 0.000s
  forge_ipadapter.py: done in 0.007s
  kohya_hrfix.py: done in 0.000s
  forge_latent_modifier.py: done in 0.004s
  lora_script.py: done in 0.358s
  forge_multidiffusion.py: done in 0.004s
  forge_never_oom.py: done in 0.000s
  forge_perturbed_attention.py: done in 0.001s
  forge_sag.py: done in 0.000s
  forge_stylealign.py: done in 0.001s
  soft_inpainting.py: done in 0.000s
[-] ADetailer initialized. version: 24.8.0, num models: 12
  !adetailer.py: done in 0.534s
  settings.py: done in 0.078s
  pixelart.py: done in 0.002s
  postprocessing_pixelart.py: done in 0.001s
  console_log_patch.py: done in 0.367s
  reactor_api.py: done in 0.160s
22:23:02 - ReActor - STATUS - Running v0.7.1-a2 on Device: CUDA
  reactor_faceswap.py: done in 0.005s
  reactor_globals.py: done in 0.001s
  reactor_helpers.py: done in 0.000s
  reactor_logger.py: done in 0.001s
  reactor_swapper.py: done in 0.001s
  reactor_version.py: done in 0.001s
  reactor_xyz.py: done in 0.080s
  comments.py: done in 0.071s
  refiner.py: done in 0.001s
  sampler.py: done in 0.000s
  seed.py: done in 0.001s
load upscalers: done in 0.005s
refresh VAE: done in 0.003s
scripts list_unets: done in 0.000s
reload hypernetworks: done in 0.005s
initialize extra networks: done in 0.003s
scripts before_ui_callback: done in 0.002s
2024-08-31 22:23:04,070 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'F:\\Software\\AI\\stable-diffusion-webui-forge\\models\\Stable-diffusion\\flux1-dev-Q8_0.gguf', 'hash': 'b44b9b8a'}, 'additional_modules': ['F:\\Software\\AI\\stable-diffusion-webui-forge\\models\\VAE\\clip_l.safetensors', 'F:\\Software\\AI\\stable-diffusion-webui-forge\\models\\VAE\\t5xxl_fp8_e4m3fn.safetensors', 'F:\\Software\\AI\\stable-diffusion-webui-forge\\models\\VAE\\ae.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
create ui: done in 2.439s
Running on local URL:  https://0.0.0.0:7862

To create a public link, set `share=True` in `launch()`.
gradio launch: done in 4.792s
add APIs: done in 0.011s
  app_started_callback:
  controlnet.py: done in 0.005s
  lora_script.py: done in 0.001s
  !adetailer.py: done in 0.001s
?? LobeTheme: Initializing...
  settings.py: done in 0.004s
  reactor_api.py: done in 0.014s
Startup time: 23.0s (prepare environment: 5.0s, import torch: 6.7s, initialize shared: 0.1s, other imports: 0.6s, load scripts: 3.3s, create ui: 2.4s, gradio launch: 4.8s).
Environment vars changed: {'stream': False, 'inference_memory': 0.0, 'pin_shared_memory': False}
Loading Model: {'checkpoint_info': {'filename': 'F:\\Software\\AI\\stable-diffusion-webui-forge\\models\\Stable-diffusion\\flux1-dev-Q8_0.gguf', 'hash': 'b44b9b8a'}, 'additional_modules': ['F:\\Software\\AI\\stable-diffusion-webui-forge\\models\\VAE\\clip_l.safetensors', 'F:\\Software\\AI\\stable-diffusion-webui-forge\\models\\VAE\\t5xxl_fp8_e4m3fn.safetensors', 'F:\\Software\\AI\\stable-diffusion-webui-forge\\models\\VAE\\ae.safetensors'], 'unet_storage_dtype': None}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
StateDict Keys: {'transformer': 780, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0}
Using Detected T5 Data Type: torch.float8_e4m3fn
Using Detected UNet Type: gguf
Using pre-quant state dict!
GGUF state dict: {'Q8_0': 304}
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'gguf', 'computation_dtype': torch.bfloat16}
Model loaded in 15.8s (unload existing model: 0.2s, forge model load: 15.6s).
[LORA] Loaded F:\Software\AI\stable-diffusion-webui-forge\models\Lora\Styles\flux_Gen_5_Trainer_Sprites.safetensors for KModel-UNet with 304 keys at weight 1.0 (skipped 0 keys) with on_the_fly = False
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 6699.54 MB for cuda:0 with 0 models keep loaded ... Done.
[Memory Management] Target: JointTextEncoder, Free GPU: 22982.00 MB, Model Require: 5153.49 MB, Previously Loaded: 0.00 MB, Inference Require: 0.00 MB, Remaining: 17828.51 MB, All loaded to GPU.
Moving model(s) has taken 2.39 seconds
Distilled CFG Scale: 3.5
[Unload] Trying to free 17045.65 MB for cuda:0 with 0 models keep loaded ... Current free memory is 17659.43 MB ... Done.
[Memory Management] Target: KModel, Free GPU: 17659.43 MB, Model Require: 12119.55 MB, Previously Loaded: 0.00 MB, Inference Require: 0.00 MB, Remaining: 5539.88 MB, All loaded to GPU.
Moving model(s) has taken 13.69 seconds
100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 20/20 [00:13<00:00,  1.47it/s]
[Unload] Trying to free 4495.77 MB for cuda:0 with 0 models keep loaded ... Current free memory is 5070.08 MB ... Done.¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 20/20 [00:12<00:00,  1.48it/s]
[Memory Management] Target: IntegratedAutoencoderKL, Free GPU: 5058.27 MB, Model Require: 159.87 MB, Previously Loaded: 0.00 MB, Inference Require: 0.00 MB, Remaining: 4898.39 MB, All loaded to GPU.
Moving model(s) has taken 0.29 seconds
Total progress: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 20/20 [00:13<00:00,  1.47it/s]
Total progress: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 20/20 [00:13<00:00,  1.48it/s]```

BenDes21 commented 2 months ago

did you fixed the problem ?

blakejrobinson commented 2 months ago

did you fixed the problem ?

Nope, it's still occurring in the latest current commit f40930c5

lllyasviel / stable-diffusion-webui-forge

GGUF Q8_0 and Automatic Diffusion (8bit LoRa) generating poor quality #1624