[Bug]: RuntimeError: Loading lora model failed

Your current environment

$ python collect_env.py
Collecting environment information...
PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             12
On-line CPU(s) list:                0-11
Thread(s) per core:                 2
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              85
Model name:                         Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                           7
CPU MHz:                            2200.204
BogoMIPS:                           4400.40
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           6 MiB
L3 cache:                           38.5 MiB
NUMA node0 CPU(s):                  0-11
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT Host state unknown
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==2.2.0
[pip3] torchvision==0.17.0
[pip3] triton==2.2.0
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] torch                     2.2.0                    pypi_0    pypi
[conda] torchvision               0.17.0                   pypi_0    pypi
[conda] triton                    2.2.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-11            N/A             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I am trying to load LoRA model and getting error.

Docker Cmd:

docker run --runtime nvidia --gpus all -p 8000:8000 --env "HUGGING_FACE_HUB_TOKEN=<SNIP>" vllm/vllm-openai --model="mistralai/Mistral-7B-v0.1" --dtype auto --tensor-parallel-size 1  --gpu-memory-utilization 0.9 --max-model-len 4096 --enable-lora --lora-modules ai-lora="ai-maker-space/mistral7b_instruct_generation"

Docker logs

$ docker run --runtime nvidia --gpus all -p 8000:8000 --env "HUGGING_FACE_HUB_TOKEN=SNIP" vllm/vllm-openai --model="mistralai/Mistral-7B-v0.1" --dtype auto --tensor-parallel-size 1  --gpu-memory-utilization 0.9 --max-model-len 4096 --enable-lora --lora-modules ai-lora="ai-maker-space/mistral7b_instruct_generation"
INFO 04-03 03:46:22 api_server.py:228] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=[LoRA(name='ai-lora', local_path='ai-maker-space/mistral7b_instruct_generation')], chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='mistralai/Mistral-7B-v0.1', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=True, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 04-03 03:46:23 llm_engine.py:87] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 04-03 03:46:27 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 04-03 03:47:14 llm_engine.py:357] # GPU blocks: 2796, # CPU blocks: 2048
INFO 04-03 03:47:16 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-03 03:47:16 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-03 03:47:23 model_runner.py:756] Graph capturing finished in 7 secs.
WARNING 04-03 03:47:23 serving_chat.py:306] No chat template provided. Chat API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

INFO:     10.52.12.179:53067 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-03 03:47:27 async_llm_engine.py:436] Received request cmpl-dff142a30da3480595415f23d9eba52b: prompt: '<s>[INST] What does following alerts says and how do you resolve this alert? 10.34.4.88 interface xe-0/0/0 is transitioning carrier states [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4048, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 733, 16289, 28793, 1824, 1235, 2296, 389, 9916, 2627, 304, 910, 511, 368, 11024, 456, 10977, 28804, 28705, 28740, 28734, 28723, 28770, 28781, 28723, 28781, 28723, 28783, 28783, 4971, 1318, 28706, 28733, 28734, 28748, 28734, 28748, 28734, 349, 8265, 288, 20320, 4605, 733, 28748, 16289, 28793], lora_request: LoRARequest(lora_name='ai-lora', lora_int_id=1, lora_local_path='ai-maker-space/mistral7b_instruct_generation').
INFO 04-03 03:47:27 async_llm_engine.py:133] Aborted request cmpl-dff142a30da3480595415f23d9eba52b.
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f7a41870160>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f7a3775ad10>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f7a41870160>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f7a3775ad10>)>
Traceback (most recent call last):
  File "/workspace/vllm/lora/worker_manager.py", line 139, in _load_lora
    lora = self._lora_model_cls.from_local_checkpoint(
  File "/workspace/vllm/lora/models.py", line 214, in from_local_checkpoint
    raise ValueError(f"{lora_dir} doesn't contain tensors")
ValueError: ai-maker-space/mistral7b_instruct_generation doesn't contain tensors

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/workspace/vllm/engine/async_llm_engine.py", line 414, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/workspace/vllm/engine/async_llm_engine.py", line 393, in engine_step
    request_outputs = await self.engine.step_async()
  File "/workspace/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/workspace/vllm/engine/async_llm_engine.py", line 276, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 574, in execute_model
    self.set_active_loras(lora_requests, lora_mapping)
  File "/workspace/vllm/worker/model_runner.py", line 660, in set_active_loras
    self.lora_manager.set_active_loras(lora_requests, lora_mapping)
  File "/workspace/vllm/lora/worker_manager.py", line 112, in set_active_loras
    self._apply_loras(lora_requests)
  File "/workspace/vllm/lora/worker_manager.py", line 224, in _apply_loras
    self.add_lora(lora)
  File "/workspace/vllm/lora/worker_manager.py", line 231, in add_lora
    lora = self._load_lora(lora_request)
  File "/workspace/vllm/lora/worker_manager.py", line 150, in _load_lora
    raise RuntimeError(
RuntimeError: Loading lora ai-maker-space/mistral7b_instruct_generation failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/workspace/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    raise exc
  File "/workspace/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 260, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 237, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 580, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f7a0af8e8c0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 257, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
INFO 04-03 03:47:33 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

Application Code:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://[domain]:8000/v1"

client = OpenAI(
  api_key=openai_api_key,
  base_url=openai_api_base,
)

chat_completion = client.chat.completions.create(
messages=[{
  "role": "assistant",
  "content": "You are a networking expert.",
  "role": "user",
  "content": "What does following alerts says and how do you resolve this alert? 99.99.99.99 interface xe-0/0/0 is transitioning carrier states"
}],
model="ai-lora",
stream=True
)

for chunk in chat_completion:
  print(chunk.choices[0].delta.content, end="")

App logs:

 python3 chat.py                                                                                                                                                29s  08:47:13 PM
NoneTraceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 209, in _receive_event
    event = self._h11_state.next_event()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/h11/_connection.py", line 469, in next_event
    event = self._extract_next_receive_event()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/h11/_connection.py", line 419, in _extract_next_receive_event
    event = self._reader.read_eof()  # type: ignore[attr-defined]
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/h11/_readers.py", line 204, in read_eof
    raise RemoteProtocolError(
h11._util.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_transports/default.py", line 66, in map_httpcore_exceptions
    yield
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_transports/default.py", line 110, in __iter__
    for part in self._httpcore_stream:
  File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 347, in __iter__
    for part in self._stream:
  File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 337, in __iter__
    raise exc
  File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 329, in __iter__
    for chunk in self._connection._receive_response_body(**kwargs):
  File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 198, in _receive_response_body
    event = self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 208, in _receive_event
    with map_exceptions({h11.RemoteProtocolError: RemoteProtocolError}):
  File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/aniket/Desktop/DemoS2LLM/chat.py", line 23, in <module>
    for chunk in chat_completion:
  File "/opt/homebrew/lib/python3.11/site-packages/openai/_streaming.py", line 44, in __iter__
    for item in self._iterator:
  File "/opt/homebrew/lib/python3.11/site-packages/openai/_streaming.py", line 56, in __stream__
    for sse in iterator:
  File "/opt/homebrew/lib/python3.11/site-packages/openai/_streaming.py", line 48, in _iter_events
    yield from self._decoder.iter(self.response.iter_lines())
  File "/opt/homebrew/lib/python3.11/site-packages/openai/_streaming.py", line 224, in iter
    for line in iterator:
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_models.py", line 857, in iter_lines
    for text in self.iter_text():
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_models.py", line 844, in iter_text
    for byte_content in self.iter_bytes():
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_models.py", line 823, in iter_bytes
    for raw_bytes in self.iter_raw():
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_models.py", line 881, in iter_raw
    for raw_stream_bytes in self.stream:
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_client.py", line 123, in __iter__
    for chunk in self._stream:
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_transports/default.py", line 109, in __iter__
    with map_httpcore_exceptions():
  File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/homebrew/lib/python3.11/site-packages/httpx/_transports/default.py", line 83, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

vllm-project / vllm

[Bug]: RuntimeError: Loading lora model failed #3815

Your current environment

🐛 Describe the bug