BitNet on M2-Ultra with T-MAC (LUT-based) vs llama.cpp (dequantization-based)
BitNet and Phi-3.5 tokens/s with # of CPU cores on Surface Laptop 7
10/21/2024 šš: BitNet, powered by T-MAC, is open-sourced.
10/10/2024 šš: By updating and rebasing our llama.cpp version, T-MAC now support more models (e.g., qwen2) and the end-to-end performance is further improved by 10~15%! Try qwen2 using the Official GPTQ model.
08/21/2024 šš: T-MAC paper is accepted by EuroSys 2025.
08/17/2024 š: T-MAC now supports 1/2/4-bit quantized models of (almost) any architecture in GPTQ format.
08/14/2024 š: The T-MAC GEMM (N>1) kernels are now integrated into llama.cpp to accelerate prefill. Check Prefill speedup for speedup.
07/27/2024 āØ: We've noted that T-MAC is even faster than the NPU in token generation speed on the latest Snapdragon X Elite chipset! Check Compared to NPU for more details.
07/23/2024 šš: We've enabled the execution of any 2-bit quantized Llama model in GPTQ format via T-MAC! Test it using the pretrained models released by EfficientQAT.
07/22/2024 šš: We've added native deployment support for Windows on ARM. T-MAC demonstrates a substantial 5x speedup on the Surface Laptop 7.
T-MAC is a kernel library to directly support mixed-precision matrix multiplication (int1/2/3/4 x int8/fp16/fp32) without the need for dequantization by utilizing lookup tables. T-MAC aims to boost low-bit LLM inference on CPUs. T-MAC already offers support for various low-bit models, including W4A16 from GPTQ/gguf, W2A16 from BitDistiller/EfficientQAT and W1(.58)A8 from BitNet on OSX/Linux/Windows equipped with ARM/Intel CPUs.
T-MAC achieves a token generation throughput of 20 tokens/sec with a single core and 48 tokens/sec with four cores on Surface Laptop 7 for 3B BitNet, which is a 4~5x speedup compared to SOTA CPU low-bit framework (llama.cpp). T-MAC can even reach 11 tokens/sec on lower-end devices like Raspberry Pi 5.
All of the following data is profiled based on llama.cpp b2794 (May 2024). The latest T-MAC and baseline, after updating the llama.cpp version, is further optimized by 10~15%.
We evaluate the token generation performance of different models on five different devices: Surface Laptop 7, Apple M2-Ultra, Jetson AGX Orin, Raspberry Pi 5 and Surface Book 3. Check datasheet for more details.
We evaluate BitNet-3B and Llama-2-7B (W2) with T-MAC 2-bit and llama.cpp Q2_K, and evaluate Llama-2-7B (W4) with T-MAC 4-bit and llama.cpp Q4_0.
In addition to providing a significant speedup, T-MAC can also match the same performance using fewer CPU cores. For instance, to reach 40 tokens/sec, a throughput that greatly surpasses human reading speed, T-MAC only requires 2 cores, while llama.cpp requires 8 cores. On Jetson AGX Orin, to achieve 10 tokens/sec, a throughput that already meets human reading speed, T-MAC only requires 2 cores, while llama.cpp uses all 12 cores. T-MAC can meet real-time requirements on less powerful devices equipped with fewer CPU cores like Raspberry Pi 5. By using fewer cores, T-MAC can reserve computational resources for other applications and significantly reduce power and energy consumption, both of which are crucial for edge devices.
T-MAC achieves significant speedup at single-threads and consumes much less CPU cores to reach the same throughput
The throughputs of T-MAC are obtained without fast-aggregation. Users can toggle on fast-aggregation through
-fa
to achieve an additional speedup of 10%~20% with.
The figure above shows that when the model size is increased to 7B-4bit, the multi-threading throughput of llama.cpp on Surface Laptop 7 becomes highly unstable due to the thermal threshold under Better Performance mode. This instability is not observed with T-MAC, as LUT is more energy-efficient compared to multiply-add operations. To establish a more solid baseline, we re-profile the performance under the Best Performance mode:
The throughput of T-MAC and llama.cpp both increase by maximizing CPU frequency
However, under real-world situations, CPUs can't maintain maximum frequency consistently on edge devices. The performance of llama.cpp will degrade as indicated by the results under the Better Performance mode.
TODO: add more results
We have compared the prefill throughput (input_len=256) for Llama-2-7b (W2) on Surface Laptop 7 with two baselines:
Model | NUM_THREADS | Batch Size | T-MAC (tokens/sec) | llama.cpp (OpenBLAS) | llama.cpp |
---|---|---|---|---|---|
llama-2-7b (W2) | 4 | 256 | 50.1 | 21.5 | 12.0 |
llama-2-7b (W2) | 8 | 256 | 94.4 | 37.7 | 21.3 |
Our GEMM kernels demonstrate superior performance over SOTA low-bit GEMM on CPU. The following figure shows the speedup compared to llama.cpp for llama-7b kernels during token generation (NUM_THREADS=1):
llama.cpp doesn't provide 1-bit kernel implementation, but we can deduce it from the 2-bit, as it won't bring additional speedup according to the 2/3/4-bit results.
Surface stands for Surface Book 3 in this section.
T-MAC can achieve significant speedup for multi-batch (N>1) GEMM due to reduced computaional cost, which ensures superior performance on prompt evaluation and multi-batch token generation. The following figures shows the speedup compared to llama.cpp using OpenBLAS backend (NUM_THREADS=1):
M2-Ultra is an exception as it is equipped with a specially designed AMX coprocessor to accelerate multi-batch GEMM. However, T-MAC can still achieve comparable performance at 2-bit.
By replacing heavy fused-multiply-add instructions with table lookup instructions, T-MAC significantly reduces power consumption. Combined with the speedup, T-MAC ultimately results in a substantial decrease in total energy consumption.
Multi-threading power/energy consumption on M2-Ultra for three models, M1: Llama-2-7B (W4), M2: Llama-2-7B (W2) and M3: BitNet-3B
Data sampled with powermetrics.
On the latest Snapdragon X Elite chipset, CPU through T-MAC achieves better performance compared to NPU through Qualcomm Snapdragon Neural Processing Engine (NPE).
When deploying the llama-2-7b-4bit model on it, the NPU can only generate 10.4 tokens/sec (according to the data released here), while the CPU using T-MAC can reach 12.6 tokens/sec with two cores, and even up to 22 tokens/sec. Considering that T-MAC's computing performance can linearly improve with the number of bits decreases (which is not observable on GPUs and NPUs based on dequantization), T-MAC can even match the NPU with a single-core CPU at 2 bits.
Framework | Model | NUM_THREADS | Throughput (tokens/sec) |
---|---|---|---|
T-MAC (CPU) | llama-2-7b (W4) | 2 | 12.6 |
T-MAC (CPU) | llama-2-7b (W4) | 4 | 18.7 |
T-MAC (CPU) | llama-2-7b (W2) | 1 | 9.3 |
T-MAC (CPU) | llama-2-7b (W2) | 4 | 28.4 |
NPE (NPU) | llama-2-7b (W4) | - | 10.4 |
For fair comparison, we have aligned our settings with those of the NPU, including a input length of 1024 and an output length of 1024. Although Qualcomms deploy a model of 3.6GB, we deploy a slightly larger model of 3.7GB, due to our token-embed remaining un-quantized.
By maximizing CPU frequency, T-MAC (CPU) can even get better results. Refer to the discussion in End-2-End speedup.
T-MAC achieves comparable 2-bit mpGEMM performance compared to CUDA GPU on Jetson AGX Orin. While the CUDA GPU outperforms the CPU in executing kernels other than mpGEMM, making the end-to-end performance of T-MAC (CPU) slightly slower, T-MAC can deliver considerable savings in power and energy consumption.
Framework | Throughput (tokens/sec) | Power (W) | Energy (J/token) |
---|---|---|---|
llama.cpp (CPU) | 7.08 | 15.0 | 2.12 |
llama.cpp (GPU) | 20.03 | 30.8 | 1.54 |
T-MAC (CPU) | 15.62 | 10.4 | 0.66 |
Throughput/power/energy comparison for Llama-2-7B (W2) on NVIDIA Jetson AGX Orin (NUM_THREADS=12 for CPU)
Data sampled with jetson-stats under power mode MAXN.
After that, you can verify the installation through: python -c "import t_mac; print(t_mac.__version__); from tvm.contrib.clang import find_clang; print(find_clang())"
.
Currently, we supports end-to-end inference through llama.cpp integration.
We have provided an all-in-one script. Invoke it with:
pip install 3rdparty/llama.cpp/gguf-py
huggingface-cli download 1bitLLM/bitnet_b1_58-3B --local-dir ${model_dir}
python tools/run_pipeline.py -o ${model_dir}
We have also supported models in GTPQ format from GPTQModel/EfficientQAT. Try it out with officially released EfficientQAT (of GPTQ format) Llama-3-8b-instruct-w2-g128:
huggingface-cli download ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w2g128-GPTQ --local-dir ${model_dir}
python tools/run_pipeline.py -o ${model_dir} -m llama-3-8b-2bit
Use
-p
or-s
argument to select the steps you want to run.Use
-u
argument to use our prebuilt kernels for ARM.Use
-m gptq-auto
for GPTQ models not in preset. The kernel shapes and quantization configurations will be automatically detected and validated.We have supported mainstream LLM models in GPTQ format (e.g., Llama-2, Llama-3, Mistral, Phi-3-mini, etc). Some models are unsupported by convert script. We welcome contributions from community.
An example output:
Running STEP.0: Compile kernels
Running command in /Users/user/jianyu/T-MAC/deploy:
python compile.py -o tuned -da -nt 4 -tb -gc -gs 128 -ags 64 -t -m hf-bitnet-3b -r
Running STEP.1: Build T-MAC C++ CMakeFiles
Running command in /Users/user/jianyu/T-MAC/build:
cmake -DCMAKE_INSTALL_PREFIX=/Users/user/jianyu/T-MAC/install ..
Running STEP.2: Install T-MAC C++
Running command in /Users/user/jianyu/T-MAC/build:
cmake --build . --target install --config Release
Running STEP.3: Convert HF to GGUF
Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp:
python convert-hf-to-gguf-t-mac.py /Users/user/Downloads/test_models/hf-bitnet-3B --outtype i2 --outfile /Users/user/Downloads/test_models/hf-bitnet-3B/ggml-model.i2.gguf --kcfg /Users/user/jianyu/T-MAC/install/lib/kcfg.ini
Running STEP.4: Build llama.cpp CMakeFiles
Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
cmake .. -DLLAMA_TMAC=ON -DCMAKE_PREFIX_PATH=/Users/user/jianyu/T-MAC/install/lib/cmake/t-mac -DCMAKE_BUILD_TYPE=Release -DLLAMA_LLAMAFILE_DEFAULT=OFF -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
Running STEP.5: Build llama.cpp
Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
cmake --build . --target main --config Release
Running STEP.6: Run inference
Running command in /Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build:
/Users/user/jianyu/T-MAC/3rdparty/llama.cpp/build/bin/main -m /Users/user/Downloads/test_models/hf-bitnet-3B/ggml-model.i2.gguf -n 128 -t 4 -p Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. -b 1 -ngl 0 -c 2048
Check logs/2024-07-15-17-10-11.log for inference output
Please note that main is used here do demo token generation output. Use 3rdparty/llama.cpp/build/bin/llama-bench
to benchmark performance. A benchmark script is also provided at tools/bench_e2e.py
.
Check T-MAC v1.0.0 release plan for upcoming features.
LLM inference incurs significant computational cost. Low-bit quantization, a widely adopted technique, introduces the challenge of mixed-precision GEMM (mpGEMM), which is not directly supported by hardware and requires convert/dequant operations.
We propose the use of a lookup table (LUT) to support mpGEMM. Our method involves the following key technniques:
Our method exhibits several notable characteristics:
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{wei2024tmaccpurenaissancetable,
title={T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge},
author={Jianyu Wei and Shijie Cao and Ting Cao and Lingxiao Ma and Lei Wang and Yanyong Zhang and Mao Yang},
year={2024},
eprint={2407.00088},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2407.00088},
}