mlc-ai / llm-perf-bench

Apache License 2.0
114 stars 12 forks source link

llama.cpp compilation settings are suboptimal #10

Closed JohannesGaessler closed 1 year ago

JohannesGaessler commented 1 year ago

llama.cpp has a compilation setting LLAMA_CUDA_MMV_Y which defaults to 1. However, on an RTX 3090 setting LLAMA_CUDA_MMV_Y=2 is ~2% faster and I would expect the setting to also be beneficial for the hardware tested here.

junrushao commented 1 year ago

Is there a page I could refer to for all those compilation options?

junrushao commented 1 year ago

Let's consolidate this to #8 and looking forward to your PR :)

JohannesGaessler commented 1 year ago

The compilation options are listed on the llama.cpp README.

junrushao commented 1 year ago

Understood. I'm peronsonally not an expert in llama.cpp, and would be great to rely on pros like you to help us find out the best parameters for LLAMA_CUDA_FORCE_DMMV, LLAMA_CUDA_DMMV_X, LLAMA_CUDA_MMV_Y, LLAMA_CUDA_F16, LLAMA_CUDA_KQUANTS_ITER.

Also CC @zxybazh who contributed the llama.cpp results

zxybazh commented 1 year ago

Thanks @JohannesGaessler for the recommended compilation setting! On your official llama.cpp README I found the LLAMA_CUDA_MMV_Y has a Does not affect k-quants. decription, since we are using q4_k_m ggml binary, would you please elaborate how to understand the impact of this option?

On the other hand, I'm definitely not expert with llama.cpp, would you please help point me to a more comprehensive guide to set the correct compilation flags? A few typical testing hardwares are 3090, 3090Ti, 4090, 4090Ti, A10G, A100, H100 GPUs. This could greatly benefit all cuda GPU users of llama.cpp.

JohannesGaessler commented 1 year ago

I forgot to update that part of the READM. It definitely does affect k-quants.

JohannesGaessler commented 1 year ago

More generally, I don't know for certain the optimal compilation settings for those GPUs because I only have an RTX 3090 and several P40s for testing. That's why I left them as compilation options with relatively conservative defaults that should at least not severely degrade performance on any GPU.