mlc-ai / llm-perf-bench

Apache License 2.0
109 stars 12 forks source link

llama.cpp thread parameter is suboptimal #8

Closed JohannesGaessler closed 11 months ago

JohannesGaessler commented 1 year ago

The benchmark container for llama.cpp does not seem to be manually setting the number of threads. As of right now more than one thread is of no use in llama.cpp when you can offload all layers with CUDA (but they still add overhead/CPU load). So manually setting the number of threads to 1 should yield better performance.

junrushao commented 1 year ago

To be clear, we offload all layers to CUDA in llama.cpp, so would you like to elaborate where we expect multi-threading to be useful? Which compilation option are you referring to? Would be awesome if you'd love to submit a PR for fix?

JohannesGaessler commented 1 year ago

I have some time this weekend so maybe I'll just fix this in llama.cpp since it's an issue with usability. Then you'll only need to update the git version.

junrushao commented 11 months ago

@JohannesGaessler Hey I'd love to follow up here with you on performance tweaking in llama.cpp's CUDA backend. As you mentioned previously, the default options seem suboptimal in many ways, so what do you think are the best combination of compilation flags we should use to get the best perf out of llama.cpp? Your timely response is much appreciated!

JohannesGaessler commented 11 months ago

Depends on the hardware configuration. On my system with an RTX 3090 I get the best performance with LLAMA_CUBLAS=1 LLAMA_CUDA_MMV_Y=2 LLAMA_CUDA_F16=1. But ust test the flags described in the README yourself, there aren't that many. On recent commits -nommq has also become faster although this increases VRAM usage. In any case, the issue with thread count has been resolved.