vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.51k stars 3.88k forks source link

[Bug]: multi lora request bug #7103

Open sukibean163 opened 1 month ago

sukibean163 commented 1 month ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.31

Python version: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H800
GPU 1: NVIDIA H800
GPU 2: NVIDIA H800
GPU 3: NVIDIA H800
GPU 4: NVIDIA H800
GPU 5: NVIDIA H800
GPU 6: NVIDIA H800
GPU 7: NVIDIA H800

Nvidia driver version: 535.54.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          172
On-line CPU(s) list:             0-171
Thread(s) per core:              2
Core(s) per socket:              43
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8476C
Stepping:                        6
CPU MHz:                         2600.000
BogoMIPS:                        5200.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       4 MiB
L1i cache:                       2.7 MiB
L2 cache:                        172 MiB
L3 cache:                        195 MiB
NUMA node0 CPU(s):               0-85
NUMA node1 CPU(s):               86-171
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq movdiri movdir64b fsrm arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnx==1.16.1
[pip3] onnxruntime==1.18.0
[pip3] sentence-transformers==3.0.1
[pip3] torch==2.3.1
[pip3] torchmetrics==1.4.0
[pip3] torchvision==0.18.1
[pip3] transformers==4.42.4
[pip3] triton==2.3.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] sentence-transformers     3.0.1                    pypi_0    pypi
[conda] torch                     2.3.1                    pypi_0    pypi
[conda] torchmetrics              1.4.0                    pypi_0    pypi
[conda] torchvision               0.18.1                   pypi_0    pypi
[conda] transformers              4.42.4                   pypi_0    pypi
[conda] triton                    2.3.1                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.5 8.0 8.6 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2       NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     PIX     NODE       NODE    NODE    SYS     SYS     SYS     SYS     0-85    0              N/A
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     NODE    PIXNODE    NODE    SYS     SYS     SYS     SYS     0-85    0      N/A
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     NODE    NODE       PIX     NODE    SYS     SYS     SYS     SYS     0-85    0              N/A
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     NODE    NODE       NODE    PIX     SYS     SYS     SYS     SYS     0-85    0              N/A
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     SYS     SYSSYS     SYS     PIX     NODE    NODE    NODE    86-171  1      N/A
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     SYS     SYSSYS     SYS     NODE    PIX     NODE    NODE    86-171  1      N/A
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     SYS     SYSSYS     SYS     NODE    NODE    PIX     NODE    86-171  1      N/A
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      SYS     SYSSYS     SYS     NODE    NODE    NODE    PIX     86-171  1      N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE       NODE    NODE    SYS     SYS     SYS     SYS
NIC1    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X NODE    NODE    SYS     SYS     SYS     SYS
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE        X      NODE    SYS     SYS     SYS     SYS
NIC3    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE       NODE     X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYSSYS     SYS      X      NODE    NODE    NODE
NIC5    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYSSYS     SYS     NODE     X      NODE    NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYSSYS     SYS     NODE    NODE     X      NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYSSYS     SYS     NODE    NODE    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
  NIC4: mlx5_bond_4
  NIC5: mlx5_bond_5
  NIC6: mlx5_bond_6
  NIC7: mlx5_bond_7

🐛 Describe the bug

server

vllm serve qwen/Qwen2-7B-Instruct --port 8000 --served-model-name gpt-3.5-turbo --disable-log-stats --tensor-parallel-size 4 --gpu-memory-utilization 0.25 --enable-lora --lora-modules lora1=saves/qwen2_lora1/lora/sft lora2=saves/qwen2_lora2/lora/sft

client(client.py)

from openai import OpenAI
import sys
port=sys.argv[1] if len(sys.argv) > 1 else 8000
model=sys.argv[2] if len(sys.argv) > 2 else "gpt-3.5-turbo"

api_base = f'localhost:{port}/v1'
client = OpenAI(base_url=api_base, api_key="xxx")

while True:
    import time
    start = time.time()
    # 调用 chat.completions.create 方法并启用流式接口
    response = client.chat.completions.create(
        model=model,
        messages=[
            # {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "讲一个微小说,500字以内。"}
        ],
        stream=True  # 启用流式接口
    )

    # 逐步接收数据
    idx = 0
    for chunk in response:
        end = time.time()
        if idx == 0:
            print(f"time:{end-start}")
            print(chunk.choices[0].delta.content)
        start = time.time()
        idx += 1
    time.sleep(.1)

step

1. start server
2. start two client
    * python client.py 8000 lora1
    * python client.py 8000 lora2
3. stop one client

bug log

ERROR: Exception in ASGI application Traceback (most recent call last): File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 75, in app await response(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/responses.py", line 258, in call async with anyio.create_task_group() as task_group: File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in aexit raise exceptions[0] File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap await func() File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/responses.py", line 250, in stream_response async for chunk in self.body_iterator: File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 228, in chat_completion_stream_generator async for res in result_generator: File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 772, in generate async for output in self._process_request( File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 888, in _process_request raise e File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 884, in _process_request async for request_output in stream: File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 93, in anext raise result File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion return_value = task.result() File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in aexit self._do_exit(exc_type) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit raise asyncio.TimeoutError asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/routing.py", line 75, in app await response(scope, receive, send) File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/responses.py", line 258, in call async with anyio.create_task_group() as task_group: File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in aexit raise exceptions[0] File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap await func() File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/starlette/responses.py", line 250, in stream_response async for chunk in self.body_iterator: File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 228, in chat_completion_stream_generator async for res in result_generator: File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 772, in generate async for output in self._process_request( File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 873, in _process_request stream = await self.add_request( File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 676, in add_request self.start_background_loop() File "/home/jovyan/conda-env/envs/xqh_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 516, in start_background_loop raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

jeejeelee commented 1 month ago

It seems there are no errors related to LoRA. Could you use offline inference to check if there's an issue with LoRA?