leejet / stable-diffusion.cpp

Stable Diffusion in pure C/C++
MIT License
2.89k stars 231 forks source link

Enabling cuda GPU acceleration #6

Open klosax opened 10 months ago

klosax commented 10 months ago

It is possible to use cuBLAS by enabling it when compiling: -DGGML_CUBLAS=ON

Maybe add this to the readme?

Green-Sky commented 10 months ago

did you run it/benchmark it?

klosax commented 10 months ago

Tests on 6-core Ryzen 5 and GTX 1660 Parameter --threads is set to physical cores.

cuBLAS is fastest when using F16 model and about 4.5s (22%) faster than without blas

OpenBLAS is only faster (1.6s) than without blas when using F32 model and setting openblas threads to physical cores.

Time for one sampling step:

test Q4_0 Q8_0 F16 F 32 comment
cublas 16.12 16.60 16.05 16.28
w/o blas 19.46 19.20 20.54 23.86
openblas 20.02 19.86 20.77 22.28 env var 6 threads
openblas 30.86 29.30 32.26 29.68 default 12 threads

Time for decode_first_stage stays at about 56s in all tests.

OpenBLAS environment variable to set number of threads: export OPENBLAS_NUM_THREADS=6

klosax commented 10 months ago

It may be possible to find an optimal setting by testing different combinations of openblas threads and --threads. I guess this also applies to llama.cpp when using openblas.

leejet commented 10 months ago

It is possible to use cuBLAS by enabling it when compiling: -DGGML_CUBLAS=ON

Maybe add this to the readme?

GPU support is already in my TODO list, and I'm working on adding it. However, I need to make ggml_conv_2d works on the GPU first.

leejet commented 10 months ago

Time for decode_first_stage stays at about 56s in all tests.

This is expected as ggml conv 2d will not be optimized by blas and will not run on GPU, I am working on this issue

leejet commented 10 months ago

It may be possible to find an optimal setting by testing different combinations of openblas threads and --threads. I guess this also applies to llama.cpp when using openblas.

Because ggml's thread is always busy waiting, even if no computation task is performed. As a result, it competes with blas threads, sometimes resulting in negative optimization. This is the point that needs to be optimized.

ggerganov commented 10 months ago

This is expected as ggml conv 2d will not be optimized by blas and will not run on GPU, I am working on this issue

Btw, I'm not sure if the CPU version of conv 2d is optimal - most likely it is not. There might be additional improvements possible if implemented properly

klosax commented 10 months ago

GPU support is already in my TODO list, and I'm working on adding it. However, I need to make ggml_conv_2d works on the GPU first.

Great. But using cuBLAS is currently better than anything else if you have a cuda GPU.

Because ggml's thread is always busy waiting, even if no computation task is performed. As a result, it competes with blas threads, sometimes resulting in negative optimization. This is the point that needs to be optimized.

Dont know if OpenBlas will have any real benefits over building without blas. And it looks like support for OpenBlas will soon be removed. See https://github.com/ggerganov/llama.cpp/pull/2372

Happenedtostumblein commented 10 months ago

@klosax I suspect something is off with your benchmarks because it seems like the speed gain should be much higher with GPU vs CPU. Am I wrong?

Using -DGGML_CLBLAST and applying the patch provided by ggml creator in #48, the GPU does get activated BUT CPU temps don’t drop by much. GPU stays hot after completion, so maybe those flags are not really doing anything.

Did yesterday’s cuda update affect your speed at all?

LeonNerd commented 9 months ago

so,i have a question, the project can only run on the cpu?i found vram is low when use cuBLAS by enabling it: -DGGML_CUBLAS=ON and It has been a long time when use FP16 model. what should i do for better infer time.

juniofaathir commented 9 months ago

@LeonNerd I think the main reason of this project exist is can run with CPU only + low RAM. But when we look at README.md, GPU inference is still on development.

And for shorter generation time, well maybe just generate 512 x 512 image only(don't use any BLAS), or get better CPU(?)

LeonNerd commented 9 months ago

Can we get some inspiration from clip.cpp?

FSSRepo commented 7 months ago

@LeonNerd You can now activate the CUDA backend with -DSD_CUBLAS=ON, @klosax you can close this issue.

Green-Sky commented 7 months ago

@klosax would be cool if you could rerun the benchmarks :wink: