mlc-ai / llm-perf-bench

Apache License 2.0
109 stars 12 forks source link

LLM Performance Benchmarking

Performance

All experiments are based on fp16 activation and compute, decoding 256 tokens with a prompt "What is the meaning of life?". And all numbers are based on PCIe, not NVLink.

Single GPU, 4-bit

Model GPU MLC LLM (tok/sec) Exllama V2 (tok/sec) Llama.cpp (tok/sec)
Llama2-7B RTX 3090 Ti 186.7 161.67 144.93
Llama2-13B RTX 3090 Ti 107.4 92.11 86.65
Llama2-7B RTX 4090 204.8 177.46 151.1
Llama2-13B RTX 4090 113.5 105.94 88.0

Multiple NVIDIA GPUs, FP16

Model GPU MLC LLM (tok/sec) Exllama V2 (tok/sec) Llama.cpp (tok/sec) vLLM (tok/sec)
Llama2-70B A100 x 2 17.0 N/A 10.46 15.27
A100 x 4 26.6 N/A 11.07 17.64
A100 x 8 38.8 N/A 9.37 14.32
A10G x 8 21.8 N/A 6.91 13.9
CodeLlama-34B A10G x 4 24.8 N/A 14.37 16.67
A10G x 8 41.3 N/A 11.83 23.5

Exllama doesn't support fp16.

Multiple NVIDIA GPUs, 4-bit

Model GPU MLC-LLM exllama Llama.cpp vLLM
Llama2-70B A100 x 2 40.9 32.64 17.35 21.4
A100 x 4 55.8 30.36 15.45 21.36
A100 x 8 59.4 32.23 11.2 17.6
A10G x 2 19.8 13.48 11.98 12.89
A10G x 4 34.3 13.48 13.37 16.91
A10G x 8 47.7 13.48 8.01 20.79
RTX 4090 x 2 34.5 24.39 17.55 23.8
CodeLlama-34B A10G x 2 38.4 25.86 21.93 23.67
A10G x 4 61.2 25.84 23.53 29.83
A10G x 8 84.2 25.82 13.25 N/A
RTX 4090 x 2 64.9 45.59 31.78 26.16

Multiple AMD GPUs, 4-bit

Model GPU MLC-LLM
Llama2-70B 7900 XTX x 2 29.9
CodeLlama-34B 7900 XTX x 2 56.5

Instructions

Prerequisites

GPU Docker. Before proceeding, make sure you have NVIDIA Docker installed for NVIDIA GPUs. Follow the installation guide at NVIDIA Docker Installation Guide for detailed instructions.

CUDA ROCm
```bash docker run --gpus all \ nvidia/cuda:12.1.1-devel-ubuntu22.04 nvidia-smi ``` ```bash docker run --device=/dev/kfd --device=/dev/dri \ --security-opt seccomp=unconfined \ --group-add video \ rocm/rocm-terminal rocm-smi ```

Repository Setup. Clone the repository, as all subsequent steps assume you are in the repository root:

git clone https://github.com/mlc-ai/llm-perf-bench
cd llm-perf-bench

Now you are ready to proceed with the next steps in the repository.


MLC LLM

In this section, we use int4 quantized Llama2 as an example.

Step 1. Build Docker image and download pre-quantized weights from HuggingFace, then log into the docker image and activate Python environment:

```bash git lfs install git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1 # git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-13b-chat-hf-q4f16_1 # git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-70b-chat-hf-q4f16_1 # git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-70b-chat-hf-q0f16 # git clone https://huggingface.co/mlc-ai/mlc-chat-CodeLlama-7b-Instruct-hf-q4f16_1 # git clone https://huggingface.co/mlc-ai/mlc-chat-CodeLlama-13b-Instruct-hf-q4f16_1 # git clone https://huggingface.co/mlc-ai/mlc-chat-CodeLlama-34b-Instruct-hf-q4f16_1 # git clone https://huggingface.co/mlc-ai/mlc-chat-CodeLlama-34b-Instruct-hf-q0f16 ```
CUDA ROCm
```bash docker build --no-cache -t llm-perf-mlc:v0.1 \ -f ./docker/Dockerfile.cu121.mlc . ./docker/bash.sh llm-perf-mlc:v0.1 ``` ```bash docker build --no-cache -t llm-perf-mlc:v0.1 \ -f ./docker/Dockerfile.rocm57.mlc . ./docker/bash.sh --amd llm-perf-mlc:v0.1 ```

Step 2. Stay logged in, set some basic environment variables for convenient scripting.

```bash conda activate python311 MODEL_NAME=Llama-2-7b-chat-hf QUANTIZATION=q4f16_1 NUM_SHARDS=1 PATH_COMPILE=/tmp/model/ PATH_TEST=/tmp/test/ MODEL_CONFIG=./model_configs/${MODEL_NAME}.json WEIGHT_PATH=$(pwd)/mlc-chat-${MODEL_NAME}-${QUANTIZATION}/ if [ -e "$WEIGHT_PATH/mlc-chat-config.json" ]; then sed -i "/\"num_shards\"/c\ \"num_shards\": ${NUM_SHARDS}," $WEIGHT_PATH/mlc-chat-config.json else echo "Path '$WEIGHT_PATH/mlc-chat-config.json' does not exist." exit fi rm -rf $PATH_TEST && mkdir $PATH_TEST && rm -rf $PATH_COMPILE && mkdir $PATH_COMPILE && ln -s ${WEIGHT_PATH} ${PATH_TEST}/params && cp $MODEL_CONFIG $PATH_COMPILE/config.json ```

Step 3. Stay logged in, and compile MLC model lib. It may take a few seconds:

CUDA ROCm
```bash python -m mlc_llm.build \ --model $PATH_COMPILE \ --artifact-path $PATH_COMPILE \ --quantization $QUANTIZATION \ --max-seq-len 2048 \ --num-shards $NUM_SHARDS \ --target cuda --use-cuda-graph --build-model-only mv $PATH_COMPILE/model-${QUANTIZATION}/model-${QUANTIZATION}-cuda.so \ $PATH_TEST/${MODEL_NAME}-${QUANTIZATION}-cuda.so ``` ```bash python -m mlc_llm.build \ --model $PATH_COMPILE \ --artifact-path $PATH_COMPILE \ --quantization $QUANTIZATION \ --max-seq-len 2048 \ --num-shards $NUM_SHARDS \ --target rocm --build-model-only mv $PATH_COMPILE/model-${QUANTIZATION}/model-${QUANTIZATION}-rocm.so \ $PATH_TEST/${MODEL_NAME}-${QUANTIZATION}-rocm.so ```

Step 4. Stay logged in, and run benchmarking:

CUDA ROCm
```bash python -m mlc_chat.cli.benchmark \ --model ${PATH_TEST}/params \ --device "cuda:0" \ --prompt "What is the meaning of life?" \ --generate-length 256 ``` ```bash python -m mlc_chat.cli.benchmark \ --model ${PATH_TEST}/params \ --device "rocm:0" \ --prompt "What is the meaning of life?" \ --generate-length 256 ```

Exllama V2

In this section, we use Llama2 GPTQ model as an example.

Step 1. Build Docker image and download pre-quantized weights from HuggingFace, then log into the docker image and activate Python environment:

```bash git lfs install git clone https://huggingface.co/TheBloke/Llama-2-7B-GPTQ # git clone https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ # git clone https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ docker build --no-cache -t llm-perf-exllama-v2:v0.1 \ -f ./docker/Dockerfile.cu121.exllama_v2 . ./docker/bash.sh llm-perf-exllama-v2:v0.1 conda activate python311 ```

NOTE. Docker image building for ExllamaV2 is particularly memory consuming on certain GPU instances. Kill the process in time if it lags or screen freezes.

Step 2. Stay logged in, run benchmarking

For single GPU: ```bash MODEL_PATH=/workspace/Llama-2-7B-GPTQ/ OUTPUT_LEN=256 cd /exllamav2 python test_inference.py -m $MODEL_PATH -p "What is the meaning of life?" -t $OUTPUT_LEN ``` For Multiple GPU: ```bash MODEL_PATH=$(pwd)/Llama-2-7B-GPTQ/ OUTPUT_LEN=256 GPU_SPLIT="17,17" # depend on how you want to split memory cd /exllamav2 python test_inference.py -m $MODEL_PATH -p "What is the meaning of life?" -gs $GPU_SPLIT -t $OUTPUT_LEN ```

Llama.cpp

Step 1. Build Docker image:

```bash docker build --no-cache -t llm-perf-llama-cpp:v0.1 -f ./docker/Dockerfile.cu121.llama_cpp . ```

Step 2. Download the quantized GGML models from HuggingFace:

```bash mkdir -p ./llama_cpp_models wget -O ./llama_cpp_models/llama-2-7b-chat.Q4_0.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf wget -O ./llama_cpp_models/llama-2-70b-chat.Q4_0.gguf https://huggingface.co/TheBloke/Llama-2-70B-chat-GGUF/resolve/main/llama-2-70b-chat.Q4_0.gguf wget -O ./llama_cpp_models/codellama-34b.Q4_0.gguf https://huggingface.co/TheBloke/CodeLlama-34B-GGUF/resolve/main/codellama-34b.Q4_0.gguf # wget -O ./llama_cpp_models/llama-2-13b-chat.Q4_0.gguf https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf # wget -O ./llama_cpp_models/llama-2-70b-chat.Q4_0.gguf https://huggingface.co/TheBloke/Llama-2-70B-chat-GGUF/resolve/main/llama-2-70b-chat.Q4_0.gguf ```

Step 3. Log into docker, and the CLI tool to see the performance numbers. Note that modify CUDA_VISIBLE_DEVICES settings for different numbers of GPUs experiments.

```bash ./docker/bash.sh llm-perf-llama-cpp:v0.1 cd /llama.cpp # run quantized Llama-2-7B models on a single GPU. CUDA_VISIBLE_DEVICES=0 ./build/bin/main -m /workspace/llama_cpp_models/llama-2-7b-chat.Q4_0.gguf -p "What is the meaning of life?" -n 256 -ngl 999 --ignore-eos # test quantized Llama-2-70B models on 2 GPUS. CUDA_VISIBLE_DEVICES=0,1 ./build/bin/main -m /workspace/llama_cpp_models/llama-2-70b-chat.Q4_0.gguf -p "What is the meaning of life?" -n 256 -ngl 999 --ignore-eos ```

Note. For float16 models, stay logged in and convert the hf models (download here) to GGUF FP16 format first.

```bash cd /llama.cpp conda activate python311 # convert the weight using llama.cpp script python3 convert.py /path/to/Llama-2-70b-hf/ \ --outfile /workspace/llama_cpp_models/llama-2-70b.fp16.gguf # run fp16 models on 4 GPUs. CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/main -m /workspace/llama_cpp_models/llama-2-70b.fp16.gguf -p "What is the meaning of life?" -n 256 -ngl 999 --ignore-eos ```

HuggingFace Transformer

Step 1. Build Docker image:

```bash docker build -t llm-perf-hf:v0.1 -f ./docker/Dockerfile.cu121.hf . ```

Step 2. Download Llama-2 weight from huggingface.

```bash git lfs install git clone https://huggingface.co/meta-llama/Llama-2-7b-hf # git clone https://huggingface.co/meta-llama/Llama-2-13b-hf # git clone https://huggingface.co/meta-llama/Llama-2-70b-hf ```

Step 3. Log into docker and run the python script to see the performance numbers. Note that modify CUDA_VISIBLE_DEVICES settings for different numbers of GPUs experiments:

```bash ./docker/bash.sh llm-perf-hf:v0.1 conda activate python311 # run fp16 Llama-2-7b models on a single GPU. CUDA_VISIBLE_DEVICES=0 python scripts/benchmark_hf.py --model-path ./Llama-2-7b-hf --format q0f16 --prompt "What is the meaning of life?" --max-new-tokens 256 # run int 4 quantized Llama-2-70b model on two GPUs. CUDA_VISIBLE_DEVICES=0,1 python scripts/benchmark_hf.py --model-path ./Llama-2-70b-hf --format q4f16 --prompt "What is the meaning of life?" --max-new-tokens 256 ```

vLLM

In this section, we use Llama2 GPTQ model as an example.

Step 1. Build Docker image and download pre-quantized weights from HuggingFace, then log into the docker image and activate Python environment:

```bash git lfs install git clone https://huggingface.co/TheBloke/Llama-2-7B-fp16 # You can also git clone awq models, e.g. # git clone https://huggingface.co/TheBloke/Llama-2-70B-AWQ docker build --no-cache -t llm-perf-vllm:v0.1 \ -f ./docker/Dockerfile.cu118.vllm . ./docker/bash.sh llm-perf-vllm:v0.1 conda activate python311 ```

Step 2. Modify script and run benchmarking

To skip limitation of max number of batched tokens, we can use the following script to skip argument verification, and make the benchmark results more readable: ```bash sed -i '287s/self._verify_args()/# self._verify_args()/' /vllm/vllm/config.py sed -i '63i\ print(f"Speed: {args.output_len / np.mean(latencies):.2f} tok/s")' /vllm/benchmarks/benchmark_latency.py sed -i '64i\ print(f"Speed: {np.mean(latencies)/ args.output_len:.5f} s/tok")' /vllm/benchmarks/benchmark_latency.py ``` To benchmark fp16 performance: ```bash MODEL_PATH=/workspace/Llama-2-7B-fp16/ OUTPUT_LEN=256 GPU_NUM=1 cd /vllm && python benchmarks/benchmark_latency.py \ --model $MODEL_PATH \ --output-len $OUTPUT_LEN \ --tensor-parallel-size $GPU_NUM \ --batch-size 1 \ --input-len 7 # for prompt "What is the meaning of life?" ``` And for 4-bit AWQ model: ```bash MODEL_PATH=/workspace/Llama-2-7B-AWQ/ OUTPUT_LEN=256 GPU_NUM=1 cd /vllm && python benchmarks/benchmark_latency.py \ --model $MODEL_PATH \ --output-len $OUTPUT_LEN \ --tensor-parallel-size $GPU_NUM \ --batch-size 1 \ --quantization awq \ --input-len 7 # for prompt "What is the meaning of life?" ```

Setup Details

We are using the following commits: