$ python collect_env.py
Collecting environment information...
PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.31
Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping: 7
CPU MHz: 2200.204
BogoMIPS: 4400.40
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 192 KiB
L1i cache: 192 KiB
L2 cache: 6 MiB
L3 cache: 38.5 MiB
NUMA node0 CPU(s): 0-11
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==2.2.0
[pip3] torchvision==0.17.0
[pip3] triton==2.2.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] torch 2.2.0 pypi_0 pypi
[conda] torchvision 0.17.0 pypi_0 pypi
[conda] triton 2.2.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-11 N/A N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
I am trying to load LoRA model and getting error.
Docker Cmd:
docker run --runtime nvidia --gpus all -p 8000:8000 --env "HUGGING_FACE_HUB_TOKEN=<SNIP>" vllm/vllm-openai --model="mistralai/Mistral-7B-v0.1" --dtype auto --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 4096 --enable-lora --lora-modules ai-lora="ai-maker-space/mistral7b_instruct_generation"
Docker logs
$ docker run --runtime nvidia --gpus all -p 8000:8000 --env "HUGGING_FACE_HUB_TOKEN=SNIP" vllm/vllm-openai --model="mistralai/Mistral-7B-v0.1" --dtype auto --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 4096 --enable-lora --lora-modules ai-lora="ai-maker-space/mistral7b_instruct_generation"
INFO 04-03 03:46:22 api_server.py:228] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=[LoRA(name='ai-lora', local_path='ai-maker-space/mistral7b_instruct_generation')], chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='mistralai/Mistral-7B-v0.1', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=True, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 04-03 03:46:23 llm_engine.py:87] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 04-03 03:46:27 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 04-03 03:47:14 llm_engine.py:357] # GPU blocks: 2796, # CPU blocks: 2048
INFO 04-03 03:47:16 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-03 03:47:16 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-03 03:47:23 model_runner.py:756] Graph capturing finished in 7 secs.
WARNING 04-03 03:47:23 serving_chat.py:306] No chat template provided. Chat API will not work.
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.
INFO: 10.52.12.179:53067 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-03 03:47:27 async_llm_engine.py:436] Received request cmpl-dff142a30da3480595415f23d9eba52b: prompt: '<s>[INST] What does following alerts says and how do you resolve this alert? 10.34.4.88 interface xe-0/0/0 is transitioning carrier states [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4048, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 733, 16289, 28793, 1824, 1235, 2296, 389, 9916, 2627, 304, 910, 511, 368, 11024, 456, 10977, 28804, 28705, 28740, 28734, 28723, 28770, 28781, 28723, 28781, 28723, 28783, 28783, 4971, 1318, 28706, 28733, 28734, 28748, 28734, 28748, 28734, 349, 8265, 288, 20320, 4605, 733, 28748, 16289, 28793], lora_request: LoRARequest(lora_name='ai-lora', lora_int_id=1, lora_local_path='ai-maker-space/mistral7b_instruct_generation').
INFO 04-03 03:47:27 async_llm_engine.py:133] Aborted request cmpl-dff142a30da3480595415f23d9eba52b.
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f7a41870160>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f7a3775ad10>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f7a41870160>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f7a3775ad10>)>
Traceback (most recent call last):
File "/workspace/vllm/lora/worker_manager.py", line 139, in _load_lora
lora = self._lora_model_cls.from_local_checkpoint(
File "/workspace/vllm/lora/models.py", line 214, in from_local_checkpoint
raise ValueError(f"{lora_dir} doesn't contain tensors")
ValueError: ai-maker-space/mistral7b_instruct_generation doesn't contain tensors
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/workspace/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
task.result()
File "/workspace/vllm/engine/async_llm_engine.py", line 414, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/workspace/vllm/engine/async_llm_engine.py", line 393, in engine_step
request_outputs = await self.engine.step_async()
File "/workspace/vllm/engine/async_llm_engine.py", line 189, in step_async
all_outputs = await self._run_workers_async(
File "/workspace/vllm/engine/async_llm_engine.py", line 276, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm/worker/worker.py", line 223, in execute_model
output = self.model_runner.execute_model(seq_group_metadata_list,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm/worker/model_runner.py", line 574, in execute_model
self.set_active_loras(lora_requests, lora_mapping)
File "/workspace/vllm/worker/model_runner.py", line 660, in set_active_loras
self.lora_manager.set_active_loras(lora_requests, lora_mapping)
File "/workspace/vllm/lora/worker_manager.py", line 112, in set_active_loras
self._apply_loras(lora_requests)
File "/workspace/vllm/lora/worker_manager.py", line 224, in _apply_loras
self.add_lora(lora)
File "/workspace/vllm/lora/worker_manager.py", line 231, in add_lora
lora = self._load_lora(lora_request)
File "/workspace/vllm/lora/worker_manager.py", line 150, in _load_lora
raise RuntimeError(
RuntimeError: Loading lora ai-maker-space/mistral7b_instruct_generation failed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/workspace/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
raise exc
File "/workspace/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 260, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 237, in listen_for_disconnect
message = await receive()
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 580, in receive
await self.message_event.wait()
File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f7a0af8e8c0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 758, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 778, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 299, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 79, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 257, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
INFO 04-03 03:47:33 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
^CINFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1]
Application Code:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://[domain]:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_completion = client.chat.completions.create(
messages=[{
"role": "assistant",
"content": "You are a networking expert.",
"role": "user",
"content": "What does following alerts says and how do you resolve this alert? 99.99.99.99 interface xe-0/0/0 is transitioning carrier states"
}],
model="ai-lora",
stream=True
)
for chunk in chat_completion:
print(chunk.choices[0].delta.content, end="")
App logs:
python3 chat.py 29s 08:47:13 PM
NoneTraceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 209, in _receive_event
event = self._h11_state.next_event()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/h11/_connection.py", line 469, in next_event
event = self._extract_next_receive_event()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/h11/_connection.py", line 419, in _extract_next_receive_event
event = self._reader.read_eof() # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/h11/_readers.py", line 204, in read_eof
raise RemoteProtocolError(
h11._util.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_transports/default.py", line 66, in map_httpcore_exceptions
yield
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_transports/default.py", line 110, in __iter__
for part in self._httpcore_stream:
File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 347, in __iter__
for part in self._stream:
File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 337, in __iter__
raise exc
File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 329, in __iter__
for chunk in self._connection._receive_response_body(**kwargs):
File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 198, in _receive_response_body
event = self._receive_event(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_sync/http11.py", line 208, in _receive_event
with map_exceptions({h11.RemoteProtocolError: RemoteProtocolError}):
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/homebrew/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/aniket/Desktop/DemoS2LLM/chat.py", line 23, in <module>
for chunk in chat_completion:
File "/opt/homebrew/lib/python3.11/site-packages/openai/_streaming.py", line 44, in __iter__
for item in self._iterator:
File "/opt/homebrew/lib/python3.11/site-packages/openai/_streaming.py", line 56, in __stream__
for sse in iterator:
File "/opt/homebrew/lib/python3.11/site-packages/openai/_streaming.py", line 48, in _iter_events
yield from self._decoder.iter(self.response.iter_lines())
File "/opt/homebrew/lib/python3.11/site-packages/openai/_streaming.py", line 224, in iter
for line in iterator:
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_models.py", line 857, in iter_lines
for text in self.iter_text():
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_models.py", line 844, in iter_text
for byte_content in self.iter_bytes():
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_models.py", line 823, in iter_bytes
for raw_bytes in self.iter_raw():
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_models.py", line 881, in iter_raw
for raw_stream_bytes in self.stream:
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_client.py", line 123, in __iter__
for chunk in self._stream:
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_transports/default.py", line 109, in __iter__
with map_httpcore_exceptions():
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/homebrew/lib/python3.11/site-packages/httpx/_transports/default.py", line 83, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
Your current environment
🐛 Describe the bug
I am trying to load LoRA model and getting error.
Docker Cmd:
Docker logs
Application Code:
App logs: