sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.2k stars 527 forks source link

[Bug] Make multi-lora serving compatible with cuda graph and radix cache #1921

Open LIUKAI0815 opened 2 weeks ago

LIUKAI0815 commented 2 weeks ago

Checklist

Describe the bug

Traceback (most recent call last): File "/root/miniconda3/envs/sglang/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/sglang/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/launch_server.py", line 16, in raise e File "/root/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/launch_server.py", line 14, in launch_server(server_args) File "/root/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/server.py", line 436, in launch_server launch_engine(server_args=server_args) File "/root/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/server.py", line 349, in launch_engine server_args.check_server_args() File "/root/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/server_args.py", line 698, in check_server_args and (self.lora_paths is None or self.disable_radix_cache) AssertionError: compatibility of lora and cuda graph and radix attention is in progress

Reproduction

export CUDA_VISIBLE_DEVICES=2 export VLLM_USE_MODELSCOPE= False python -m sglang.launch_server \ --model-path ./Qwen2_5-14B-Instruct-AWQ \ --port 2015 \ --host 0.0.0.0 \ --trust-remote-code \ --tensor-parallel-size 1 \ --quantization awq \ --attention-backend flashinfer \ --lora-paths role=/workspace/output/role/qwen/qwen2_5-14b-instruct-awq/v1-20241101-133149/checkpoint-1550 \

Environment

sglang 0.3.5

merrymercy commented 1 week ago

try to add --disable-radix-cache and/or --disable-cuda-graph

DhruvaBansal00 commented 2 days ago

@merrymercy I am interested in taking this issue up alongside supporting dynamic loading/unloading of LoRA adapters