[Bug]: Live preview not working with fp8.

Necro-mancer commented 9 months ago

Checklist

[x] The issue exists after disabling all extensions
[x] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[x] The issue exists in the current version of the webui
[ ] The issue has not been reported before recently
[ ] The issue has been reported before but has not been fixed yet

What happened?

RTX 3070 8gb, windows 11, latest drivers.

Fp8 doesn't work. Is it a mistake on my end or is it something that has not been implemented yet?

Toggling the option in settings has no effect on vram usage, even after restarting webui and console. I tried the fp8-unet commandline arguments but it gives out cuda errors. This is the exact error: RuntimeError: "div_true_cuda" not implemented for 'Float8_e4m3fn' (same for float8_e5) I know that forge inherently uses less vram than auto1111 at default settings. But with fp8 enabled in auto1111, i can do much more with sdxl without overflowing into shared vram (and generation slowing to a crawl).

Steps to reproduce the problem

Enable fp8 from settings. Generate image.

What should have happened?

Use less vram

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo-2024-02-10-19-37.json

Console logs

venv "C:\Users\Admin\stable-diffusion-webui-forge\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f0.0.12-latest-110-g15bb49e7
Commit hash: 15bb49e761e837c0a3463a736762d11941ea69f7
Faceswaplab : Use GPU requirements
Checking faceswaplab requirements
0.008895099999790546
Launching Web UI with arguments:
Total VRAM 8192 MB, total RAM 32689 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
VAE dtype: torch.bfloat16
Using pytorch cross attention
ControlNet preprocessor location: C:\Users\Admin\stable-diffusion-webui-forge\models\ControlNetPreprocessor
[-] ADetailer initialized. version: 24.1.2, num models: 16
Loading weights [aeb7e9e689] from C:\Users\Admin\stable-diffusion-webui-forge\models\Stable-diffusion\juggernautXL_v8Rundiffusion.safetensors
C:\Users\Admin\stable-diffusion-webui-forge\modules\gradio_extensons.py:25: GradioDeprecationWarning: `optional` parameter is deprecated, and it has no effect
  res = original_IOComponent_init(self, *args, **kwargs)
2024-02-11 00:32:45,468 - ControlNet - INFO - ControlNet UI callback registered.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
model_type EPS
UNet ADM Dimension 2816
Startup time: 20.2s (prepare environment: 4.4s, import torch: 4.3s, import gradio: 1.2s, setup paths: 0.8s, initialize shared: 0.1s, other imports: 0.7s, load scripts: 6.5s, create ui: 1.4s, gradio launch: 0.7s).
Using pytorch attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using pytorch attention in VAE
extra {'cond_stage_model.clip_l.text_projection', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids', 'cond_stage_model.clip_l.logit_scale'}
Loading VAE weights specified in settings: C:\Users\Admin\stable-diffusion-webui-forge\models\VAE\sdxl-vae-fp16.safetensors
To load target model SDXLClipModel
Begin to load 1 model
Moving model(s) has taken 0.35 seconds
Model loaded in 9.6s (load weights from disk: 1.8s, forge load real models: 6.7s, load VAE: 0.4s, calculate empty prompt: 0.6s).
To load target model SDXL
Begin to load 1 model
Moving model(s) has taken 1.63 seconds
100%|██████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.90it/s]
Total progress: 100%|██████████████████████████████████████████████| 20/20 [00:12<00:00,  1.61it/s]
100%|██████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.91it/s]
Total progress: 100%|██████████████████████████████████████████████| 20/20 [00:10<00:00,  1.94it/s]
To load target model SDXLClipModel█████████████████████████████████| 20/20 [00:10<00:00,  1.98it/s]

(fp8 was enabled in settings now)

Begin to load 1 model
Moving model(s) has taken 2.11 seconds
To load target model SDXL
Begin to load 1 model
Moving model(s) has taken 1.21 seconds
100%|██████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.96it/s]
Total progress: 100%|██████████████████████████████████████████████| 20/20 [00:10<00:00,  1.96it/s]
Total progress: 100%|██████████████████████████████████████████████| 20/20 [00:10<00:00,  1.80it/s]

(log after using fp8 commandline flag)

venv "C:\Users\Admin\stable-diffusion-webui-forge\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f0.0.12-latest-110-g15bb49e7
Commit hash: 15bb49e761e837c0a3463a736762d11941ea69f7
Faceswaplab : Use GPU requirements
Checking faceswaplab requirements
0.008753499999784253
Launching Web UI with arguments: --unet-in-fp8-e4m3fn
Total VRAM 8192 MB, total RAM 32689 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
VAE dtype: torch.bfloat16
Using pytorch cross attention
ControlNet preprocessor location: C:\Users\Admin\stable-diffusion-webui-forge\models\ControlNetPreprocessor
[-] ADetailer initialized. version: 24.1.2, num models: 16
Loading weights [aeb7e9e689] from C:\Users\Admin\stable-diffusion-webui-forge\models\Stable-diffusion\juggernautXL_v8Rundiffusion.safetensors
C:\Users\Admin\stable-diffusion-webui-forge\modules\gradio_extensons.py:25: GradioDeprecationWarning: `optional` parameter is deprecated, and it has no effect
  res = original_IOComponent_init(self, *args, **kwargs)
2024-02-11 00:48:05,317 - ControlNet - INFO - ControlNet UI callback registered.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
model_type EPS
UNet ADM Dimension 2816
Startup time: 20.0s (prepare environment: 4.4s, import torch: 4.3s, import gradio: 1.2s, setup paths: 0.8s, initialize shared: 0.1s, other imports: 0.7s, load scripts: 6.3s, create ui: 1.4s, gradio launch: 0.7s).
Using pytorch attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using pytorch attention in VAE
extra {'cond_stage_model.clip_l.logit_scale', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids', 'cond_stage_model.clip_l.text_projection'}
Loading VAE weights specified in settings: C:\Users\Admin\stable-diffusion-webui-forge\models\VAE\sdxl-vae-fp16.safetensors
To load target model SDXLClipModel
Begin to load 1 model
Moving model(s) has taken 0.37 seconds
Model loaded in 14.0s (load weights from disk: 1.9s, forge load real models: 11.0s, load VAE: 0.4s, calculate empty prompt: 0.6s).
To load target model SDXL
Begin to load 1 model
Moving model(s) has taken 0.69 seconds
100%|██████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.85it/s]
Traceback (most recent call last):█████████████████████████████████| 20/20 [00:09<00:00,  1.94it/s]
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules_forge\main_thread.py", line 37, in loop
    task.work()
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules_forge\main_thread.py", line 26, in work
    self.result = self.func(*self.args, **self.kwargs)
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules\txt2img.py", line 111, in txt2img_function
    processed = processing.process_images(p)
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules\processing.py", line 749, in process_images
    res = process_images_inner(p)
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules\processing.py", line 935, in process_images_inner
    x_samples_ddim = decode_latent_batch(p.sd_model, samples_ddim, target_device=devices.cpu, check_for_nans=True)
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules\processing.py", line 633, in decode_latent_batch
    sample = decode_first_stage(model, batch[i:i + 1])[0]
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules\sd_samplers_common.py", line 74, in decode_first_stage
    return samples_to_images_tensor(x, approx_index, model)
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules\sd_samplers_common.py", line 52, in samples_to_images_tensor
    x_sample = sd_vae_taesd.decoder_model()(sample.to(devices.device, devices.dtype)).detach()
  File "C:\Users\Admin\stable-diffusion-webui-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui-forge\venv\lib\site-packages\torch\nn\modules\container.py", line 215, in forward
    input = module(input)
  File "C:\Users\Admin\stable-diffusion-webui-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Admin\stable-diffusion-webui-forge\modules\sd_vae_taesd.py", line 23, in forward
    return torch.tanh(x / 3) * 3
RuntimeError: "div_true_cuda" not implemented for 'Float8_e4m3fn'
"div_true_cuda" not implemented for 'Float8_e4m3fn'
*** Error completing request
*** Arguments: ('task(burygx4gukalmsa)', <gradio.routes.Request object at 0x00000242ABC84DF0>, 'photograph of a cute cat reading a kids book', '', [], 20, 'DPM++ 2M', 1, 1, 3.5, 1024, 1024, False, 0.5, 1.5, 'ESRGAN-NMKD-Superscale-4x', 15, 0, 0, 'Use same checkpoint', 'Use same sampler', '', '', [], 0, False, '', 0.8, -1, False, -1, 0, 0, 0, UiControlNetUnit(input_mode=<InputMode.SIMPLE: 'simple'>, use_preview_as_input=False, batch_image_dir='', batch_mask_dir='', batch_input_gallery=[], batch_mask_gallery=[], generated_image=None, mask_image=None, hr_option='Both', enabled=False, module='None', model='None', weight=1, image=None, resize_mode='Crop and Resize', processor_res=-1, threshold_a=-1, threshold_b=-1, guidance_start=0, guidance_end=1, pixel_perfect=False, control_mode='Balanced', save_detected_map=True), UiControlNetUnit(input_mode=<InputMode.SIMPLE: 'simple'>, use_preview_as_input=False, batch_image_dir='', batch_mask_dir='', batch_input_gallery=[], batch_mask_gallery=[], generated_image=None, mask_image=None, hr_option='Both', enabled=False, module='None', model='None', weight=1, image=None, resize_mode='Crop and Resize', processor_res=-1, threshold_a=-1, threshold_b=-1, guidance_start=0, guidance_end=1, pixel_perfect=False, control_mode='Balanced', save_detected_map=True), UiControlNetUnit(input_mode=<InputMode.SIMPLE: 'simple'>, use_preview_as_input=False, batch_image_dir='', batch_mask_dir='', batch_input_gallery=[], batch_mask_gallery=[], generated_image=None, mask_image=None, hr_option='Both', enabled=False, module='None', model='None', weight=1, image=None, resize_mode='Crop and Resize', processor_res=-1, threshold_a=-1, threshold_b=-1, guidance_start=0, guidance_end=1, pixel_perfect=False, control_mode='Balanced', save_detected_map=True), False, 7, 1, 'Constant', 0, 'Constant', 0, 1, 'enable', 'MEAN', 'AD', 1, False, 1.01, 1.02, 0.99, 0.95, False, 256, 2, 0, False, False, 3, 2, 0, 0.35, True, 'bicubic', 'bicubic', False, 0, 'anisotropic', 0, 'reinhard', 100, 0, 'subtract', 0, 0, 'gaussian', 'add', 0, 100, 127, 0, 'hard_clamp', 5, 0, 'None', 'None', False, 'MultiDiffusion', 768, 768, 64, 4, False, 0.5, 2, False, False, False, False, False, 'base', False, False, {'ad_model': 'face_yolov8n_v2.pt', 'ad_prompt': '', 'ad_negative_prompt': '', 'ad_confidence': 0.81, 'ad_mask_k_largest': 0, 'ad_mask_min_ratio': 0, 'ad_mask_max_ratio': 1, 'ad_x_offset': 0, 'ad_y_offset': 0, 'ad_dilate_erode': 4, 'ad_mask_merge_invert': 'None', 'ad_mask_blur': 4, 'ad_denoising_strength': 0.4, 'ad_inpaint_only_masked': True, 'ad_inpaint_only_masked_padding': 32, 'ad_use_inpaint_width_height': False, 'ad_inpaint_width': 512, 'ad_inpaint_height': 512, 'ad_use_steps': False, 'ad_steps': 11, 'ad_use_cfg_scale': False, 'ad_cfg_scale': 7, 'ad_use_checkpoint': False, 'ad_checkpoint': 'Use same checkpoint', 'ad_use_vae': False, 'ad_vae': 'Use same VAE', 'ad_use_sampler': False, 'ad_sampler': 'DPM++ 2M Karras', 'ad_use_noise_multiplier': False, 'ad_noise_multiplier': 1, 'ad_use_clip_skip': False, 'ad_clip_skip': 1, 'ad_restore_face': False, 'ad_controlnet_model': 'None', 'ad_controlnet_module': 'None', 'ad_controlnet_weight': 1, 'ad_controlnet_guidance_start': 0, 'ad_controlnet_guidance_end': 1, 'is_api': ()}, {'ad_model': 'None', 'ad_prompt': '', 'ad_negative_prompt': '', 'ad_confidence': 0.3, 'ad_mask_k_largest': 0, 'ad_mask_min_ratio': 0, 'ad_mask_max_ratio': 1, 'ad_x_offset': 0, 'ad_y_offset': 0, 'ad_dilate_erode': 4, 'ad_mask_merge_invert': 'None', 'ad_mask_blur': 4, 'ad_denoising_strength': 0.4, 'ad_inpaint_only_masked': True, 'ad_inpaint_only_masked_padding': 32, 'ad_use_inpaint_width_height': False, 'ad_inpaint_width': 512, 'ad_inpaint_height': 512, 'ad_use_steps': False, 'ad_steps': 28, 'ad_use_cfg_scale': False, 'ad_cfg_scale': 7, 'ad_use_checkpoint': False, 'ad_checkpoint': 'Use same checkpoint', 'ad_use_vae': False, 'ad_vae': 'Use same VAE', 'ad_use_sampler': False, 'ad_sampler': 'DPM++ 2M Karras', 'ad_use_noise_multiplier': False, 'ad_noise_multiplier': 1, 'ad_use_clip_skip': False, 'ad_clip_skip': 1, 'ad_restore_face': False, 'ad_controlnet_model': 'None', 'ad_controlnet_module': 'None', 'ad_controlnet_weight': 1, 'ad_controlnet_guidance_start': 0, 'ad_controlnet_guidance_end': 1, 'is_api': ()}, True, False, 1, False, False, False, 1.1, 1.5, 100, 0.7, False, False, True, False, False, 0, 'Gustavosta/MagicPrompt-Stable-Diffusion', '', None, '', None, True, False, False, False, False, False, 0, 0, '0', 0, False, True, 0, 'Portrait of a [gender]', 'blurry', 20, ['DPM++ 2M Karras'], '', 0, None, 1, 1, '', False, False, False, 1, 0, 'Portrait of a [gender]', 'blurry', 20, ['DPM++ 2M Karras'], '', 0, None, '', None, True, False, False, False, False, False, 0, 0, '0', 0, False, True, 0, 'Portrait of a [gender]', 'blurry', 20, ['DPM++ 2M Karras'], '', 0, None, 1, 1, '', False, False, False, 1, 0, 'Portrait of a [gender]', 'blurry', 20, ['DPM++ 2M Karras'], '', 0, None, '', None, True, False, False, False, False, False, 0, 0, '0', 0, False, True, 0, 'Portrait of a [gender]', 'blurry', 20, ['DPM++ 2M Karras'], '', 0, None, 1, 1, '', False, False, False, 1, 0, 'Portrait of a [gender]', 'blurry', 20, ['DPM++ 2M Karras'], '', 0, None, 1, 1, '', 1, 1, ['After Upscaling/Before Restore Face'], 0, 'Portrait of a [gender]', 'blurry', 20, ['DPM++ 2M Karras'], '', 0, False, False, 'Matrix', 'Columns', 'Mask', 'Prompt', '1,1', '0.2', False, False, False, 'Attention', [False], '0', '0', '0.4', None, '0', '0', False, False, False, 'positive', 'comma', 0, False, False, 'start', '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, False, False, False, 0, False, '', 5, 24, 12.5, 1000, '', 'DDIM', 0, 64, 64, '', 64, 7.5, 0.42, 'DDIM', 64, 64, 1, 0, 92, True, True, True, False, False, False, 'midas_v21_small', [], 30, '', 4, [], 1, '', '', '', '') {}
    Traceback (most recent call last):
      File "C:\Users\Admin\stable-diffusion-webui-forge\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
    TypeError: 'NoneType' object is not iterable

---

Additional information

Screenshot 2024-02-11 003455 Before enabling fp8

Screenshot 2024-02-11 003556 After enabling fp8

continue-revolution commented 9 months ago

fp8 in this fork should be set via adding command line argument --unet-in-fp8-e4m3fn

Necro-mancer commented 9 months ago

Tried that too but it's giving cuda error. Console log with the commanline argument is at the end. If it's working for everyone else, then definitely the issue is at my end. Maybe someone knowlegeable can look at the console log and tell me where the problem is.

Necro-mancer commented 9 months ago

Okay, now that I looked at the log again, i think the issue stems from vae. Will try to mess with that.

Necro-mancer commented 9 months ago

So, after more testing, I had vae decoder (inside vae settings) set to taesd which errored out. Setting it back to full now actually gives out an image (with fp8-unet commandline flag) However, live preview is only working with approx-cheap now. Other methods ie taesd, approx NN don't work and interrupting the image gives the same cuda errors. Is there something that can be done about that? Approx-cheap has horrendous quality. I just want taesd to work again.

CCpt5 commented 9 months ago

So, after more testing, I had vae decoder (inside vae settings) set to taesd which errored out. Setting it back to full now actually gives out an image (with fp8-unet commandline flag) However, live preview is only working with approx-cheap now. Other methods ie taesd, approx NN don't work and interrupting the image gives the same cuda errors. Is there something that can be done about that? Approx-cheap has horrendous quality. I just want taesd to work again.

The preview not being full quality is likely related to an issue I reported here last week: https://github.com/lllyasviel/stable-diffusion-webui-forge/issues/51

Necro-mancer commented 9 months ago

Might or might not be related. From my testing, full live preview doesn't work with fp8 even if it is forced in settings. Approx NN and TAESD don't give a preview (same for full vae preview) and if generation is interrupted (ie. Using the live preview method vae for image decode) it gives the 'NoneType' object is not iterable error. Similarly, setting vae decoder to TAESD also gives the none type error when it comes to decoding the generated image (inference steps are completed normally without errors). Only approx. cheap works and as expected it's quality is horrendous. TAESD worked with fp8 in auto1111 It also works in forge without fp8

Tolga077 commented 8 months ago

same taesd + fp8 not working

Tolga077 commented 8 months ago

I tried og a1111 and there taesd and fp8 work without problems.

CCpt5 commented 8 months ago

I tried og a1111 and there taesd and fp8 work without problems.

If you're decently familiar w/ git/python, have you tried using the branch that fixed the live preview for me? It's still not merged anywhere else:

https://github.com/lllyasviel/stable-diffusion-webui-forge/tree/fix/preview-full

Branch = preview-full

Only one file (sd_samplers_common.py) was modified so you could try replacing that file on your side and see if it fixxes it. It fixed the live preview quality not being full for me.

https://github.com/lllyasviel/stable-diffusion-webui-forge/compare/main...fix/preview-full

Tolga077 commented 8 months ago

I tried og a1111 and there taesd and fp8 work without problems.

If you're decently familiar w/ git/python, have you tried using the branch that fixed the live preview for me? It's still not merged anywhere else:

https://github.com/lllyasviel/stable-diffusion-webui-forge/tree/fix/preview-full

Branch = preview-full

Only one file (sd_samplers_common.py) was modified so you could try replacing that file on your side and see if it fixxes it. It fixed the live preview quality not being full for me.

main...fix/preview-full

I want to use taesd not only for preview but also in the generation phase.It provides some vram gain.

lllyasviel / stable-diffusion-webui-forge