Closed Necro-mancer closed 1 month ago
fp8 in this fork should be set via adding command line argument --unet-in-fp8-e4m3fn
Tried that too but it's giving cuda error. Console log with the commanline argument is at the end. If it's working for everyone else, then definitely the issue is at my end. Maybe someone knowlegeable can look at the console log and tell me where the problem is.
Okay, now that I looked at the log again, i think the issue stems from vae. Will try to mess with that.
So, after more testing, I had vae decoder (inside vae settings) set to taesd which errored out. Setting it back to full now actually gives out an image (with fp8-unet commandline flag) However, live preview is only working with approx-cheap now. Other methods ie taesd, approx NN don't work and interrupting the image gives the same cuda errors. Is there something that can be done about that? Approx-cheap has horrendous quality. I just want taesd to work again.
So, after more testing, I had vae decoder (inside vae settings) set to taesd which errored out. Setting it back to full now actually gives out an image (with fp8-unet commandline flag) However, live preview is only working with approx-cheap now. Other methods ie taesd, approx NN don't work and interrupting the image gives the same cuda errors. Is there something that can be done about that? Approx-cheap has horrendous quality. I just want taesd to work again.
The preview not being full quality is likely related to an issue I reported here last week: https://github.com/lllyasviel/stable-diffusion-webui-forge/issues/51
Might or might not be related. From my testing, full live preview doesn't work with fp8 even if it is forced in settings. Approx NN and TAESD don't give a preview (same for full vae preview) and if generation is interrupted (ie. Using the live preview method vae for image decode) it gives the 'NoneType' object is not iterable error. Similarly, setting vae decoder to TAESD also gives the none type error when it comes to decoding the generated image (inference steps are completed normally without errors). Only approx. cheap works and as expected it's quality is horrendous. TAESD worked with fp8 in auto1111 It also works in forge without fp8
same taesd + fp8 not working
I tried og a1111 and there taesd and fp8 work without problems.
I tried og a1111 and there taesd and fp8 work without problems.
If you're decently familiar w/ git/python, have you tried using the branch that fixed the live preview for me? It's still not merged anywhere else:
https://github.com/lllyasviel/stable-diffusion-webui-forge/tree/fix/preview-full
Branch = preview-full
Only one file (sd_samplers_common.py) was modified so you could try replacing that file on your side and see if it fixxes it. It fixed the live preview quality not being full for me.
https://github.com/lllyasviel/stable-diffusion-webui-forge/compare/main...fix/preview-full
I tried og a1111 and there taesd and fp8 work without problems.
If you're decently familiar w/ git/python, have you tried using the branch that fixed the live preview for me? It's still not merged anywhere else:
https://github.com/lllyasviel/stable-diffusion-webui-forge/tree/fix/preview-full
Branch = preview-full
Only one file (sd_samplers_common.py) was modified so you could try replacing that file on your side and see if it fixxes it. It fixed the live preview quality not being full for me.
I want to use taesd not only for preview but also in the generation phase.It provides some vram gain.
Checklist
What happened?
RTX 3070 8gb, windows 11, latest drivers.
Fp8 doesn't work. Is it a mistake on my end or is it something that has not been implemented yet?
Toggling the option in settings has no effect on vram usage, even after restarting webui and console. I tried the fp8-unet commandline arguments but it gives out cuda errors. This is the exact error: RuntimeError: "div_true_cuda" not implemented for 'Float8_e4m3fn' (same for float8_e5) I know that forge inherently uses less vram than auto1111 at default settings. But with fp8 enabled in auto1111, i can do much more with sdxl without overflowing into shared vram (and generation slowing to a crawl).
Steps to reproduce the problem
Enable fp8 from settings. Generate image.
What should have happened?
Use less vram
What browsers do you use to access the UI ?
Mozilla Firefox
Sysinfo
sysinfo-2024-02-10-19-37.json
Console logs
Additional information
Before enabling fp8
After enabling fp8