Closed JohannesGaessler closed 1 year ago
Is there a page I could refer to for all those compilation options?
Let's consolidate this to #8 and looking forward to your PR :)
The compilation options are listed on the llama.cpp README.
Understood. I'm peronsonally not an expert in llama.cpp, and would be great to rely on pros like you to help us find out the best parameters for LLAMA_CUDA_FORCE_DMMV
, LLAMA_CUDA_DMMV_X
, LLAMA_CUDA_MMV_Y
, LLAMA_CUDA_F16
, LLAMA_CUDA_KQUANTS_ITER
.
Also CC @zxybazh who contributed the llama.cpp results
Thanks @JohannesGaessler for the recommended compilation setting! On your official llama.cpp README I found the LLAMA_CUDA_MMV_Y
has a Does not affect k-quants.
decription, since we are using q4_k_m
ggml binary, would you please elaborate how to understand the impact of this option?
On the other hand, I'm definitely not expert with llama.cpp, would you please help point me to a more comprehensive guide to set the correct compilation flags? A few typical testing hardwares are 3090, 3090Ti, 4090, 4090Ti, A10G, A100, H100 GPUs. This could greatly benefit all cuda GPU users of llama.cpp.
I forgot to update that part of the READM. It definitely does affect k-quants.
More generally, I don't know for certain the optimal compilation settings for those GPUs because I only have an RTX 3090 and several P40s for testing. That's why I left them as compilation options with relatively conservative defaults that should at least not severely degrade performance on any GPU.
llama.cpp has a compilation setting
LLAMA_CUDA_MMV_Y
which defaults to 1. However, on an RTX 3090 settingLLAMA_CUDA_MMV_Y=2
is ~2% faster and I would expect the setting to also be beneficial for the hardware tested here.