C++ implementation of ChatGLM-6B, ChatGLM2-6B, ChatGLM3 and GLM-4 for real-time chatting on your MacBook.
Highlights:
Support Matrix:
Preparation
Clone the ChatGLM.cpp repository into your local machine:
git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp
If you forgot the --recursive
flag when cloning the repository, run the following command in the chatglm.cpp
folder:
git submodule update --init --recursive
Quantize Model
Install necessary packages for loading and quantizing Hugging Face models:
python3 -m pip install -U pip
python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece
Use convert.py
to transform ChatGLM-6B into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:
python3 chatglm_cpp/convert.py -i THUDM/chatglm-6b -t q4_0 -o models/chatglm-ggml.bin
The original model (-i <model_name_or_path>
) can be a Hugging Face model name or a local path to your pre-downloaded model. Currently supported models are:
THUDM/chatglm-6b
, THUDM/chatglm-6b-int8
, THUDM/chatglm-6b-int4
THUDM/chatglm2-6b
, THUDM/chatglm2-6b-int4
THUDM/chatglm3-6b
THUDM/glm-4-9b-chat
THUDM/codegeex2-6b
, THUDM/codegeex2-6b-int4
You are free to try any of the below quantization types by specifying -t <type> : |
type | precision | symmetric |
---|---|---|---|
q4_0 |
int4 | true | |
q4_1 |
int4 | false | |
q5_0 |
int5 | true | |
q5_1 |
int5 | false | |
q8_0 |
int8 | true | |
f16 |
half | ||
f32 |
float |
For LoRA models, add -l <lora_model_name_or_path>
flag to merge your LoRA weights into the base model. For example, run python3 chatglm_cpp/convert.py -i THUDM/chatglm3-6b -t q4_0 -o models/chatglm3-ggml-lora.bin -l shibing624/chatglm3-6b-csc-chinese-lora
to merge public LoRA weights from Hugging Face.
For P-Tuning v2 models using the official finetuning script, additional weights are automatically detected by convert.py
. If past_key_values
is on the output weight list, the P-Tuning checkpoint is successfully converted.
Build & Run
Compile the project using CMake:
cmake -B build
cmake --build build -j --config Release
Now you may chat with the quantized ChatGLM-6B model by running:
./build/bin/main -m models/chatglm-ggml.bin -p 你好
# 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
To run the model in interactive mode, add the -i
flag. For example:
./build/bin/main -m models/chatglm-ggml.bin -i
In interactive mode, your chat history will serve as the context for the next-round conversation.
Run ./build/bin/main -h
to explore more options!
Try Other Models
BLAS library can be integrated to further accelerate matrix multiplication. However, in some cases, using BLAS may cause performance degradation. Whether to turn on BLAS should depend on the benchmarking result.
Accelerate Framework
Accelerate Framework is automatically enabled on macOS. To disable it, add the CMake flag -DGGML_NO_ACCELERATE=ON
.
OpenBLAS
OpenBLAS provides acceleration on CPU. Add the CMake flag -DGGML_OPENBLAS=ON
to enable it.
cmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j
CUDA
CUDA accelerates model inference on NVIDIA GPU. Add the CMake flag -DGGML_CUDA=ON
to enable it.
cmake -B build -DGGML_CUDA=ON && cmake --build build -j
By default, all kernels will be compiled for all possible CUDA architectures and it takes some time. To run on a specific type of device, you may specify CMAKE_CUDA_ARCHITECTURES
to speed up the nvcc compilation. For example:
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="80" # for A100
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="70;75" # compatible with both V100 and T4
To find out the CUDA architecture of your GPU device, see Your GPU Compute Capability.
Metal
MPS (Metal Performance Shaders) allows computation to run on powerful Apple Silicon GPU. Add the CMake flag -DGGML_METAL=ON
to enable it.
cmake -B build -DGGML_METAL=ON && cmake --build build -j
The Python binding provides high-level chat
and stream_chat
interface similar to the original Hugging Face ChatGLM(2)-6B.
Installation
Install from PyPI (recommended): will trigger compilation on your platform.
pip install -U chatglm-cpp
To enable CUDA on NVIDIA GPU:
CMAKE_ARGS="-DGGML_CUDA=ON" pip install -U chatglm-cpp
To enable Metal on Apple silicon devices:
CMAKE_ARGS="-DGGML_METAL=ON" pip install -U chatglm-cpp
You may also install from source. Add the corresponding CMAKE_ARGS
for acceleration.
# install from the latest source hosted on GitHub
pip install git+https://github.com/li-plus/chatglm.cpp.git@main
# or install from your local source after git cloning the repo
pip install .
Pre-built wheels for CPU backend on Linux / MacOS / Windows are published on release. For CUDA / Metal backends, please compile from source code or source distribution.
Using Pre-converted GGML Models
Here is a simple demo that uses chatglm_cpp.Pipeline
to load the GGML model and chat with it. First enter the examples folder (cd examples
) and launch a Python interactive shell:
>>> import chatglm_cpp
>>>
>>> pipeline = chatglm_cpp.Pipeline("../models/chatglm-ggml.bin")
>>> pipeline.chat([chatglm_cpp.ChatMessage(role="user", content="你好")])
ChatMessage(role="assistant", content="你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。", tool_calls=[])
To chat in stream, run the below Python example:
python3 cli_demo.py -m ../models/chatglm-ggml.bin -i
Launch a web demo to chat in your browser:
python3 web_demo.py -m ../models/chatglm-ggml.bin
For other models:
Converting Hugging Face LLMs at Runtime
Sometimes it might be inconvenient to convert and save the intermediate GGML models beforehand. Here is an option to directly load from the original Hugging Face model, quantize it into GGML models in a minute, and start serving. All you need is to replace the GGML model path with the Hugging Face model name or path.
>>> import chatglm_cpp
>>>
>>> pipeline = chatglm_cpp.Pipeline("THUDM/chatglm-6b", dtype="q4_0")
Loading checkpoint shards: 100%|██████████████████████████████████| 8/8 [00:10<00:00, 1.27s/it]
Processing model states: 100%|████████████████████████████████| 339/339 [00:23<00:00, 14.73it/s]
...
>>> pipeline.chat([chatglm_cpp.ChatMessage(role="user", content="你好")])
ChatMessage(role="assistant", content="你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。", tool_calls=[])
Likewise, replace the GGML model path with Hugging Face model in any example script, and it just works. For example:
python3 cli_demo.py -m THUDM/chatglm-6b -p 你好 -i
We support various kinds of API servers to integrate with popular frontends. Extra dependencies can be installed by:
pip install 'chatglm-cpp[api]'
Remember to add the corresponding CMAKE_ARGS
to enable acceleration.
LangChain API
Start the api server for LangChain:
MODEL=./models/chatglm2-ggml.bin uvicorn chatglm_cpp.langchain_api:app --host 127.0.0.1 --port 8000
Test the api endpoint with curl
:
curl http://127.0.0.1:8000 -H 'Content-Type: application/json' -d '{"prompt": "你好"}'
Run with LangChain:
>>> from langchain.llms import ChatGLM
>>>
>>> llm = ChatGLM(endpoint_url="http://127.0.0.1:8000")
>>> llm.predict("你好")
'你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。'
For more options, please refer to examples/langchain_client.py and LangChain ChatGLM Integration.
OpenAI API
Start an API server compatible with OpenAI chat completions protocol:
MODEL=./models/chatglm3-ggml.bin uvicorn chatglm_cpp.openai_api:app --host 127.0.0.1 --port 8000
Test your endpoint with curl
:
curl http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"messages": [{"role": "user", "content": "你好"}]}'
Use the OpenAI client to chat with your model:
>>> from openai import OpenAI
>>>
>>> client = OpenAI(base_url="http://127.0.0.1:8000/v1")
>>> response = client.chat.completions.create(model="default-model", messages=[{"role": "user", "content": "你好"}])
>>> response.choices[0].message.content
'你好👋!我是人工智能助手 ChatGLM3-6B,很高兴见到你,欢迎问我任何问题。'
For stream response, check out the example client script:
OPENAI_BASE_URL=http://127.0.0.1:8000/v1 python3 examples/openai_client.py --stream --prompt 你好
Tool calling is also supported:
OPENAI_BASE_URL=http://127.0.0.1:8000/v1 python3 examples/openai_client.py --tool_call --prompt 上海天气怎么样
With this API server as backend, ChatGLM.cpp models can be seamlessly integrated into any frontend that uses OpenAI-style API, including mckaywrigley/chatbot-ui, fuergaosi233/wechat-chatgpt, Yidadaa/ChatGPT-Next-Web, and more.
Option 1: Building Locally
Building docker image locally and start a container to run inference on CPU:
docker build . --network=host -t chatglm.cpp
# cpp demo
docker run -it --rm -v $PWD/models:/chatglm.cpp/models chatglm.cpp ./build/bin/main -m models/chatglm-ggml.bin -p "你好"
# python demo
docker run -it --rm -v $PWD/models:/chatglm.cpp/models chatglm.cpp python3 examples/cli_demo.py -m models/chatglm-ggml.bin -p "你好"
# langchain api server
docker run -it --rm -v $PWD/models:/chatglm.cpp/models -p 8000:8000 -e MODEL=models/chatglm-ggml.bin chatglm.cpp \
uvicorn chatglm_cpp.langchain_api:app --host 0.0.0.0 --port 8000
# openai api server
docker run -it --rm -v $PWD/models:/chatglm.cpp/models -p 8000:8000 -e MODEL=models/chatglm-ggml.bin chatglm.cpp \
uvicorn chatglm_cpp.openai_api:app --host 0.0.0.0 --port 8000
For CUDA support, make sure nvidia-docker is installed. Then run:
docker build . --network=host -t chatglm.cpp-cuda \
--build-arg BASE_IMAGE=nvidia/cuda:12.2.0-devel-ubuntu20.04 \
--build-arg CMAKE_ARGS="-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=80"
docker run -it --rm --gpus all -v $PWD/models:/chatglm.cpp/models chatglm.cpp-cuda \
./build/bin/main -m models/chatglm-ggml.bin -p "你好"
Option 2: Using Pre-built Image
The pre-built image for CPU inference is published on both Docker Hub and GitHub Container Registry (GHCR).
To pull from Docker Hub and run demo:
docker run -it --rm -v $PWD/models:/chatglm.cpp/models liplusx/chatglm.cpp:main \
./build/bin/main -m models/chatglm-ggml.bin -p "你好"
To pull from GHCR and run demo:
docker run -it --rm -v $PWD/models:/chatglm.cpp/models ghcr.io/li-plus/chatglm.cpp:main \
./build/bin/main -m models/chatglm-ggml.bin -p "你好"
Python demo and API servers are also supported in pre-built image. Use it in the same way as Option 1.
Environment:
ChatGLM-6B:
Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 | |
---|---|---|---|---|---|---|
ms/token (CPU @ Platinum 8260) | 74 | 77 | 86 | 89 | 114 | 189 |
ms/token (CUDA @ V100 SXM2) | 8.1 | 8.7 | 9.4 | 9.5 | 12.0 | 19.1 |
ms/token (MPS @ M2 Ultra) | 11.5 | 12.3 | N/A | N/A | 16.1 | 24.4 |
file size | 3.3G | 3.7G | 4.0G | 4.4G | 6.2G | 12G |
mem usage | 4.0G | 4.4G | 4.7G | 5.1G | 6.9G | 13G |
ChatGLM2-6B / ChatGLM3-6B / CodeGeeX2:
Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 | |
---|---|---|---|---|---|---|
ms/token (CPU @ Platinum 8260) | 64 | 71 | 79 | 83 | 106 | 189 |
ms/token (CUDA @ V100 SXM2) | 7.9 | 8.3 | 9.2 | 9.2 | 11.7 | 18.5 |
ms/token (MPS @ M2 Ultra) | 10.0 | 10.8 | N/A | N/A | 14.5 | 22.2 |
file size | 3.3G | 3.7G | 4.0G | 4.4G | 6.2G | 12G |
mem usage | 3.4G | 3.8G | 4.1G | 4.5G | 6.2G | 12G |
ChatGLM4-9B:
Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 | |
---|---|---|---|---|---|---|
ms/token (CPU @ Platinum 8260) | 105 | 105 | 122 | 134 | 158 | 279 |
ms/token (CUDA @ V100 SXM2) | 12.1 | 12.5 | 13.8 | 13.9 | 17.7 | 27.7 |
ms/token (MPS @ M2 Ultra) | 14.4 | 15.3 | 19.6 | 20.1 | 20.7 | 32.4 |
file size | 5.0G | 5.5G | 6.1G | 6.6G | 9.4G | 18G |
We measure model quality by evaluating the perplexity over the WikiText-2 test dataset, following the strided sliding window strategy in https://huggingface.co/docs/transformers/perplexity. Lower perplexity usually indicates a better model.
Download and unzip the dataset from link. Measure the perplexity with a stride of 512 and max input length of 2048:
./build/bin/perplexity -m models/chatglm3-base-ggml.bin -f wikitext-2-raw/wiki.test.raw -s 512 -l 2048
Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 | |
---|---|---|---|---|---|---|
ChatGLM3-6B-Base | 6.215 | 6.188 | 6.006 | 6.022 | 5.971 | 5.972 |
ChatGLM4-9B-Base | 6.834 | 6.780 | 6.645 | 6.624 | 6.576 | 6.577 |
Unit Test & Benchmark
To perform unit tests, add this CMake flag -DCHATGLM_ENABLE_TESTING=ON
to enable testing. Recompile and run the unit test (including benchmark).
mkdir -p build && cd build
cmake .. -DCHATGLM_ENABLE_TESTING=ON && make -j
./bin/chatglm_test
For benchmark only:
./bin/chatglm_test --gtest_filter='Benchmark.*'
Lint
To format the code, run make lint
inside the build
folder. You should have clang-format
, black
and isort
pre-installed.
Performance
To detect the performance bottleneck, add the CMake flag -DGGML_PERF=ON
:
cmake .. -DGGML_PERF=ON && make -j
This will print timing for each graph operation when running the model.