Open klosax opened 10 months ago
did you run it/benchmark it?
Tests on 6-core Ryzen 5 and GTX 1660
Parameter --threads
is set to physical cores.
cuBLAS is fastest when using F16 model and about 4.5s (22%) faster than without blas
OpenBLAS is only faster (1.6s) than without blas when using F32 model and setting openblas threads to physical cores.
Time for one sampling step:
test | Q4_0 | Q8_0 | F16 | F 32 | comment |
---|---|---|---|---|---|
cublas | 16.12 | 16.60 | 16.05 | 16.28 | |
w/o blas | 19.46 | 19.20 | 20.54 | 23.86 | |
openblas | 20.02 | 19.86 | 20.77 | 22.28 | env var 6 threads |
openblas | 30.86 | 29.30 | 32.26 | 29.68 | default 12 threads |
Time for decode_first_stage stays at about 56s in all tests.
OpenBLAS environment variable to set number of threads:
export OPENBLAS_NUM_THREADS=6
It may be possible to find an optimal setting by testing different combinations of openblas threads and --threads. I guess this also applies to llama.cpp when using openblas.
It is possible to use cuBLAS by enabling it when compiling:
-DGGML_CUBLAS=ON
Maybe add this to the readme?
GPU support is already in my TODO list, and I'm working on adding it. However, I need to make ggml_conv_2d works on the GPU first.
Time for decode_first_stage stays at about 56s in all tests.
This is expected as ggml conv 2d will not be optimized by blas and will not run on GPU, I am working on this issue
It may be possible to find an optimal setting by testing different combinations of openblas threads and --threads. I guess this also applies to llama.cpp when using openblas.
Because ggml's thread is always busy waiting, even if no computation task is performed. As a result, it competes with blas threads, sometimes resulting in negative optimization. This is the point that needs to be optimized.
This is expected as ggml conv 2d will not be optimized by blas and will not run on GPU, I am working on this issue
Btw, I'm not sure if the CPU version of conv 2d is optimal - most likely it is not. There might be additional improvements possible if implemented properly
GPU support is already in my TODO list, and I'm working on adding it. However, I need to make ggml_conv_2d works on the GPU first.
Great. But using cuBLAS is currently better than anything else if you have a cuda GPU.
Because ggml's thread is always busy waiting, even if no computation task is performed. As a result, it competes with blas threads, sometimes resulting in negative optimization. This is the point that needs to be optimized.
Dont know if OpenBlas will have any real benefits over building without blas. And it looks like support for OpenBlas will soon be removed. See https://github.com/ggerganov/llama.cpp/pull/2372
@klosax I suspect something is off with your benchmarks because it seems like the speed gain should be much higher with GPU vs CPU. Am I wrong?
Using -DGGML_CLBLAST and applying the patch provided by ggml creator in #48, the GPU does get activated BUT CPU temps don’t drop by much. GPU stays hot after completion, so maybe those flags are not really doing anything.
Did yesterday’s cuda update affect your speed at all?
so,i have a question, the project can only run on the cpu?i found vram is low when use cuBLAS by enabling it: -DGGML_CUBLAS=ON and It has been a long time when use FP16 model. what should i do for better infer time.
@LeonNerd I think the main reason of this project exist is can run with CPU only + low RAM. But when we look at README.md, GPU inference is still on development.
And for shorter generation time, well maybe just generate 512 x 512 image only(don't use any BLAS), or get better CPU(?)
Can we get some inspiration from clip.cpp?
@LeonNerd You can now activate the CUDA backend with -DSD_CUBLAS=ON
, @klosax you can close this issue.
@klosax would be cool if you could rerun the benchmarks :wink:
It is possible to use cuBLAS by enabling it when compiling:
-DGGML_CUBLAS=ON
Maybe add this to the readme?