MuxServe is an efficient multiple LLMs serving system with flexible spatial-temporal multiplexing.
MuxServe colocates LLMs considering their popularity to multiplex memory resources, and disaggragates and flexibly colocate prefill and decoding phases leveraging their characteristics to multiplex computation resources.
Recent years, Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. Efficienly serving multiple LLMs poses significant challenges due to varying sizes and popularity of LLMs.
MuxServe aims to serve multiple LLMs efficiently with flexible spatial-temporal multiplexing. The key insight behind is to collocate LLMs considering their popularity to multiplex memory resources, and disaggragate and flexibly colocate prefill and decoding phases leveraging their characteristics to multiplex computation resources.
MuxServe uses vLLM as the default inference engine. Please follow the instructions to install our modified MuxServe-vLLM from source:
conda create -n muxserve python=3.9
conda activate muxserve
git clone https://github.com/EfficientLLMSys/MuxServe-vLLM.git
cd MuxServe-vLLM
pip install -e . # This may take 5-10 minutes.
git clone https://github.com/EfficientLLMSys/MuxServe.git
cd MuxServe
pip install -e .
pip install -r requirements.txt
We get start with a simple example for offline serving multiple LLMs with MuxServe (examples/basic
).
We've set the config file in examples/basic/model_config.yaml
. you sould change the model checkpoint path inside the file, change /mnt/afs/share/LLMCKPTs/huggyllama/llama-30b
into yourpath/to/llama-30b
We sample a workload from the ShareGPT_V3 dataset according to the workloads defined in examples/basic/models.yaml
. The rate
represents the arrival rate for a model in req/s
. We can get the workload examples/workloads/sharedgpt_n3_rate_12_5_3.json
with the following command:
python muxserve/muxsched/workload_utils.py \
--dataset-source /yourpathto/ShareGPT_V3_unfiltered_cleaned_split.json \
--workload_info_from_yaml True \
--output-file examples/basic/sharedgpt_n3_rate_12_5_3.json \
--model-yaml examples/basic/models.yaml
MuxServe uses NVIDIA MPS to manage the SM resources. We can start the MPS service with the following command:
sudo bash scripts/start_mps.sh examples/basic/mps
After starting the MPS service, we can find nvidia-log
and nvidia-mps
directories in examples/basic/mps
.
mkdir -p log/vllm_proc
python -m muxserve.launch examples/basic/model_config.yaml \
--nnodes=1 --node-rank=0 --master-addr=127.0.0.1 \
--nproc_per_node=4 \
--server-port 4145 --flexstore-port 50025 \
--workload-file examples/basic/sharedgpt_n3_rate_12_5_3.json \
2>&1 | tee log/muxserve_test.log
close the MPS Service, see Stop the MPS. Then run same command as Run MuxServe
close the MPS Service, see Stop the MPS.
CUDA_VISIBLE_DEVICES=0 python -m muxserve.launch examples/basic/model_config_spatial_0.yaml \
--nnodes=1 --node-rank=0 --master-addr=127.0.0.1 \
--nproc_per_node=1 \
--server-port 4145 --flexstore-port 50025 \
--workload-file examples/basic/sharedgpt_n3_rate_12_5_3.json \
--split-by-model llm-0 \
2>&1 | tee log/muxserve_test_spatial_0.log & \
CUDA_VISIBLE_DEVICES=1 python -m muxserve.launch examples/basic/model_config_spatial_1.yaml \
--nnodes=1 --node-rank=0 --master-addr=127.0.0.1 \
--nproc_per_node=1 \
--server-port 4245 --flexstore-port 51025 \
--workload-file examples/basic/sharedgpt_n3_rate_12_5_3.json \
--split-by-model llm-1 \
2>&1 | tee log/muxserve_test_spatial_1.log & \
CUDA_VISIBLE_DEVICES=2,3 python -m muxserve.launch examples/basic/model_config_spatial_2.yaml \
--nnodes=1 --node-rank=0 --master-addr=127.0.0.1 \
--nproc_per_node=2 \
--server-port 4345 --flexstore-port 52025 \
--workload-file examples/basic/sharedgpt_n3_rate_12_5_3.json \
--split-by-model llm-2 \
2>&1 | tee log/muxserve_test_spatial_2.log
Stop the NVIDIA MPS Service
sudo bash scripts/stop_mps.sh examples/basic/mps
@article{duan2024muxserve,
title={MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving},
author={Duan, Jiangfei and Lu, Runyu and Duanmu, Haojie and Li, Xiuhong and Zhang, Xingcheng and Lin, Dahua and Stoica, Ion and Zhang, Hao},
journal={arXiv preprint arXiv:2404.02015},
year={2024}
}