ppl.llm.serving
is a part of PPL.LLM
system.
We recommend users who are new to this project to read the Overview of system.
ppl.llm.serving
is a serving based on ppl.nn for various Large Language Models(LLMs). This repository contains a server based on gRPC and inference support for LLaMA.
Here is a brief tutorial, refer to LLaMA Guide for more details.
Installing Prerequisites(on Debian or Ubuntu for example)
apt-get install build-essential cmake git
Cloning Source Code
git clone https://github.com/openppl-public/ppl.llm.serving.git
Building from Source
./build.sh -DPPLNN_USE_LLM_CUDA=ON -DPPLNN_CUDA_ENABLE_NCCL=ON -DPPLNN_ENABLE_CUDA_JIT=OFF -DPPLNN_CUDA_ARCHITECTURES="'80;86;87'" -DPPLCOMMON_CUDA_ARCHITECTURES="'80;86;87'"
NCCL is required if multiple GPU devices are used.
Exporting Models
Refer to ppl.pmx for details.
Running Server
./ppl-build/ppl_llama_server /path/to/server/config.json
Server config examples can be found in src/models/llama/conf
. You are expected to give the correct values before running the server.
model_dir
: path of models exported by ppl.pmx.model_param_path
: params of models. $model_dir/params.json
.tokenizer_path
: tokenizer files for sentencepiece
.Running client: send request through gRPC to query the model
./ppl-build/client_sample 127.0.0.1:23333
See tools/client_sample.cc for more details.
Benchmarking
./ppl-build/client_qps_measure --target=127.0.0.1:23333 --tokenizer=/path/to/tokenizer/path --dataset=tools/samples_1024.json --request_rate=inf
See tools/client_qps_measure.cc for more details. --request_rate
is the number of request per second, and value inf
means send all client request with no interval.
Running inference offline:
./ppl-build/offline_inference /path/to/server/config.json
See tools/offline_inference.cc for more details.
This project is distributed under the Apache License, Version 2.0.