[bug]: Offload parts of models when exceeding vram?

Jonseed commented 2 weeks ago

Is there an existing issue for this problem?

[X] I have searched the existing issues

Operating system

Windows

GPU vendor

Nvidia (CUDA)

GPU model

RTX 3060

GPU VRAM

12GB

Version number

5.3

Browser

Edge 130.0.2849.46

Python dependencies

{ "accelerate": "0.30.1", "compel": "2.0.2", "cuda": "12.4", "diffusers": "0.27.2", "numpy": "1.26.4", "opencv": "4.9.0.80", "onnx": "1.16.1", "pillow": "11.0.0", "python": "3.10.6", "torch": "2.4.1+cu124", "torchvision": "0.19.1+cu124", "transformers": "4.41.1", "xformers": null }

What happened

I'd like to use Invoke, but with Q8 GGUF quantized Flux and the bnb int8 T5 encoder, I get out of memory on my 3060 12GB. I don't get OOM with Q8 Flux in ComfyUI or Auto1111/Forge (even though I know some of the model is being offloaded to RAM since the Q8 is 12.4GB). I have to step down to Q6 quant Flux in Invoke (or bnb-nf4), and I don't like doing that. Does Invoke need more work on memory optimizations, offloading parts of models to CPU RAM or shared GPU memory when they exceed VRAM? Or this an option that I need to enable somewhere?

What you expected to happen

No OOM.

How to reproduce the problem

No response

Additional context

No response

Discord username

No response

Jonseed commented 2 weeks ago

I have already tried the "prefer sysmem fallback" option in the Nvidia Control Panel, and yet I still get OOM when Invoke tries to load Q8 Flux.

thiagojramos commented 2 weeks ago

same hereThe same thing happens to me. I'm also using the same GPU. On Forge, I can load the Q8.GGUF+t5xxl_fp16 (it gets a bit slow and freezes the first time I load it), but after that, it works fine and generates the images without any issues.

Jonseed commented 2 weeks ago

@thiagojramos yeah, I use Q8 GGUF and t5xxl_fp16 in ComfyUI all the time without any memory issues. Sometimes it gets slow, like if I have multiple LoRAs, but it never OOM.

hugodopradofernandes commented 2 weeks ago

I have the same problem. I managed to generate some few images but now I'm only receiving OOM. I prefer the model parts load/unload at each generation even taking more time than not being able to generate at all.

hellcore-org commented 1 week ago

Same here. Working on a 4060TI with 16GB. Flux FP8 is working very fine in Forge and ComfyUI. In Invoke AI the PC freezes or it needs very long time or just crashs. Here also the memory utilistation seems to be is out of control (32GB) (considering an upgrade to 64GB currently).

invoke-ai / InvokeAI