ninehills / llm-inference-benchmark

LLM Inference benchmark
MIT License
245 stars 7 forks source link

llm-inference-benchmark

LLM Inference benchmark

Inference frameworks

Framework Producibility**** Docker Image API Server OpenAI API Server WebUI Multi Models** Multi-node Backends Embedding Model
text-generation-webui Low Yes Yes Yes Yes No No Transformers/llama.cpp/ExLlama/ExLlamaV2/AutoGPTQ/AutoAWQ/GPTQ-for-LLaMa/CTransformers No
OpenLLM High Yes Yes Yes No With BentoML With BentoML Transformers(int8,int4,gptq), vLLM(awq/squeezellm), TensorRT No
vLLM* High Yes Yes Yes No No Yes(With Ray) vLLM No
Xinference High Yes Yes Yes Yes Yes Yes Transformers/vLLM/TensorRT/GGML Yes
TGI*** Medium Yes Yes No No No No Transformers/AutoGPTQ/AWQ/EETP/vLLM/ExLlama/ExLlamaV2 No
ScaleLLM Medium Yes Yes Yes Yes No No Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 No
FastChat High Yes Yes Yes Yes Yes Yes Transformers/AutoGPTQ/AWQ/vLLM/ExLlama/ExLlamaV2 Yes

Inference backends

Backend Device Compatibility** PEFT Adapters* Quatisation Batching Distributed Inference Streaming
Transformers GPU High Yes bitsandbytes(int8/int4), AutoGPTQ(gptq), AutoAWQ(awq) Yes accelerate Yes
vLLM GPU High No awq/squeezellm Yes Yes Yes
ExLlamaV2 GPU/CPU Low No GPTQ Yes Yes Yes
TensorRT GPU Medium No some models Yes Yes Yes
Candle GPU/CPU Low No No Yes Yes Yes
CTranslate2 GPU Low No Yes Yes Yes Yes
TGI GPU Medium Yes awq/eetq/gptq/bitsandbytes Yes Yes Yes
llama-cpp*** GPU/CPU High No GGUF/GPTQ Yes No Yes
lmdeploy GPU Medium No AWQ Yes Yes Yes
Deepspeed-FastGen GPU Low No No Yes Yes Yes

Benchmark

Hardware:

Software:

Model:

Data:

Backend Benchmark

No Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
text-generation-webui Transformer 40.39 0.15 41.47 0.21 344.61
text-generation-webui Transformer with flash-attention-2 58.30 0.21 43.52 0.21 341.39
text-generation-webui ExllamaV2 69.09 0.26 50.71 0.27 564.80
OpenLLM PyTorch 60.79 0.22 44.73 0.21 514.55
TGI 192.58 0.90 59.68 0.28 82.72
vLLM 222.63 1.08 62.69 0.30 95.43
TensorRT - - - - -
CTranslate2* - - - - -
lmdeploy 236.03 1.15 67.86 0.33 76.81

8Bit Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
TGI eetq 8bit 293.08 1.41 88.08 0.42 63.69
TGI GPTQ 8bit - - - - -
OpenLLM PyTorch AutoGPTQ 8bit 49.8 0.17 29.54 0.14 930.16

4Bit Quantisation

Backend TPS@4 QPS@4 TPS@1 QPS@1 FTL@1
TGI AWQ 4bit 336.47 1.61 102.00 0.48 94.84
vLLM AWQ 4bit 29.03 0.14 37.48 0.19 3711.0
text-generation-webui llama-cpp GGUF 4bit 67.63 0.37 56.65 0.34 331.57