turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

Slow inference speed on A100? #346

Closed jingzhaoou closed 2 weeks ago

jingzhaoou commented 4 months ago

According to the benchmark info on the project frontpage:

Llama2 EXL2 4.0 bpw 7B - - 164 t/s 197 t/s

I compiled ExLlama V2 from source and ran it on a A100-SXM4-80GB GPU. I got

Response generated in 8.56 seconds, 1024 tokens, 119.64 tokens/second

which seems quite slow compared with the benchmark number.

The text sent to ExLlama V2 is shared here: prompt_llm_proxy_sip_full.txt

The model is turboderp/Llama2-7B-exl2 with revision 4.0bpw. I wonder if the speed I got is expected or somehow I missed some important steps. Your help is highly appreciated.

CyberTimon commented 4 months ago

What CPU do you have? This plays a big role for small models

turboderp commented 2 weeks ago

Closing this as stale, and probably outdated since a lot has been updated since then. But yes, very slow CPUs (especially virtualized ones in server instances) can be a bottleneck. Working on it. (: