pcg-mlp / KsanaLLM

Other
195 stars 24 forks source link

KsanaLLM

English | 中文

About

KsanaLLM is a high performance and easy-to-use engine for LLM inference and serving.

High Performance and Throughput:

Flexibility and easy to use:

KsanaLLM seamlessly supports many Hugging Face models, including the below models that have been verified:

Supported Hardware

Usage

1. Create Docker container and runtime environment

1.1 For Nvidia GPU

# need install nvidia-docker from https://github.com/NVIDIA/nvidia-container-toolkit
sudo nvidia-docker run -itd --network host --privileged \
    nvcr.io/nvidia/pytorch:24.03-py3 bash
pip install -r requirements.txt
# for download huggingface model
apt update && apt install git-lfs -y

1.2 For Huawei Ascend NPU

https://ascendhub.huawei.com/#/detail/mindie version: 1.0.RC1-800I-A2-aarch64

2. Clone source code

git clone --recurse-submodules https://github.com/pcg-mlp/KsanaLLM
export GIT_PROJECT_REPO_ROOT=`pwd`/KsanaLLM

3. Compile

cd ${GIT_PROJECT_REPO_ROOT}
mkdir build && cd build

3.1 For Nvidia

# SM for A10 is 86, change it when using other gpus.
# refer to: https://developer.nvidia.cn/cuda-gpus
cmake -DSM=86 -DWITH_TESTING=ON .. && make -j32

3.2 For Huawei Ascend NPU

cmake -DWITH_TESTING=ON -DWITH_CUDA=OFF -DWITH_ACL=ON .. && make -j32

4. Run

cd ${GIT_PROJECT_REPO_ROOT}/src/ksana_llm/python
ln -s ${GIT_PROJECT_REPO_ROOT}/build/lib .

# download huggingface model for example:
git clone https://huggingface.co/NousResearch/Llama-2-7b-hf

# change the model_dir in ${GIT_PROJECT_REPO_ROOT}/examples/ksana_llm2-7b.yaml if needed

# set environment variable `NLLM_LOG_LEVEL=DEBUG` before run to get more log info
# the serving log locate in log/ksana_llm.log

# ${GIT_PROJECT_REPO_ROOT}/examples/ksana_llm2-7b.yaml's tensor_para_size equal the GPUs/NPUs number
export CUDA_VISIBLE_DEVICES=xx

# launch server
python serving_server.py \
    --config_file ${GIT_PROJECT_REPO_ROOT}/examples/ksana_llm2-7b.yaml \
    --port 8080

Inference test with one shot conversation

# open another session
cd ${GIT_PROJECT_REPO_ROOT}/examples/llama7b
python serving_generate_client.py --port 8080

Inference test with forward(Single round inference without generate sampling)

python serving_forward_client.py --port 8080

5. Distribute

cd ${GIT_PROJECT_REPO_ROOT}

# for distribute wheel
python setup.py bdist_wheel
# install wheel
pip install dist/ksana_llm-0.1-*-linux_x86_64.whl

# check install success
pip show -f ksana_llm
python -c "import ksana_llm"

6. Optional

6.1 Model Weight Map

You can include an optional weight map JSON file for models that share the same structure as the Llama model but have different weight names.

For more detailed information, please refer to the following link: Optional Weight Map Guide

6.2 Plugin

Custom plugins can perform some special pre-process and post-processing. You need to place ksana_plugin.py in the model directory. Example

7. Contact Us

WeChat Group