leejet / stable-diffusion.cpp

Stable Diffusion and Flux in pure C/C++
MIT License
3.46k stars 296 forks source link

Request: reduce memory usage for text2img #376

Open xiaogz opened 2 months ago

xiaogz commented 2 months ago

Is it possible to reduce the memory usage from spiking to 15ish GB when doing text2img? I'm currently following this guide and using the default cat prompt on leejet's q4_k and q2_k flux schnell model. Same behaviour for his q2_k model. The guide's link to the vae safetensor is inaccessible for me as I'm not part of flux-dev but I used the official black-forest-labs vae matrix.

Memory can spike up to 15ish GB before settling at 6 or 4 GB.

Using --vae-tiling flag lowers the spike to 12.95 GB. I'm not aware of any other options to further reduce memory consumption though.

For metal q2_k, I still see the 12.95 GB mem spike in activity monitor.

[DEBUG] ggml_extend.hpp:1029 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1029 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1029 - flux params backend buffer size =  3732.51 MB(VRAM) (776 tensors)
[DEBUG] ggml_extend.hpp:1029 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
...
[INFO ] stable-diffusion.cpp:486  - total params memory size = 13145.92MB (VRAM 3827.08MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 3732.51MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)

Similarly for cpu q2_k:

[DEBUG] ggml_extend.hpp:1029 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1029 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1029 - flux params backend buffer size =  3732.51 MB(RAM) (776 tensors)
[DEBUG] ggml_extend.hpp:1029 - vae params backend buffer size =  94.57 MB(RAM) (138 tensors)
...
[INFO ] stable-diffusion.cpp:486  - total params memory size = 13145.92MB (VRAM 0.00MB, RAM 13145.92MB): clip 9318.83MB(RAM), unet 3732.51MB(RAM), vae 94.57MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)

It would be great if memory usage tops out at under 8GB thanks! EDIT: More information:

SenninOne commented 2 months ago

I used a q8 version of the clip that @Green-Sky uploaded on hugging face , i am not sure how this affects quality but it lowered ram usage when loading the model

leejet commented 2 months ago

Since the text encoder is running on cpu, the actual VRAM used is less than 4G, in the log you posted.

xiaogz commented 2 months ago

Sry I meant to clarify is it possible to reduce cpu RAM usage? VRAM usage is definitely under 4GB yes but RAM usage is quite high. Even with video memory sharing some load RAM usage is >8GB.

Green-Sky commented 2 months ago

I think there are a couple of things here.

  1. the model is loaded to ram and then copied from ram to vram. in that moment it is loaded 2 times on the device.
  2. sd.cpp uses im2col to convert convolution to matmul computations. this is very space inefficent. without looking at the actual code, i have read that it can result in 80% more (compute) memory usage.
CrushDemo01 commented 1 week ago

Since the text encoder is running on cpu, the actual VRAM used is less than 4G, in the log you posted.Since the text encoder is running on cpu, the actual VRAM used is less than 4G, in the log you posted.

I want to know if I can set it up by myself so that CLIP and T5 can run on VRAM?