vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.15k stars 4.35k forks source link

[Bug]: assert parts[0] == "base_model" AssertionError #4682

Closed Edisonwei54 closed 4 months ago

Edisonwei54 commented 5 months ago

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (conda-forge gcc 13.2.0-7) 13.2.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA RTX A6000
Nvidia driver version: 535.171.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             96
On-line CPU(s) list:                0-95
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7F72 24-Core Processor
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          2
Stepping:                           0
Frequency boost:                    enabled
CPU max MHz:                        3200.0000
CPU min MHz:                        2500.0000
BogoMIPS:                           6400.17
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
Virtualization:                     AMD-V
L1d cache:                          1.5 MiB (48 instances)
L1i cache:                          1.5 MiB (48 instances)
L2 cache:                           24 MiB (48 instances)
L3 cache:                           384 MiB (24 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-23,48-71
NUMA node1 CPU(s):                  24-47,72-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm_nccl_cu12==2.18.1.0.4.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
[conda] vllm-nccl-cu12            2.18.1.0.4.0             pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-23,48-71      0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

python -m vllm.entrypoints.openai.api_server \ --model /mnt/sda/edison/llama3/Meta-Llama-3-8B-Instruct \ --enable-lora \ --lora-modules test-lora=/mnt/sda/edison/llama3/checkpoint-441 \ --gpu-memory-utilization 0.9 \ --host 0.0.0.0 \ --port 8008 \ --tensor-parallel-size 1 \ --enforce-eager

Edisonwei54 commented 5 months ago

INFO 05-08 11:38:43 async_llm_engine.py:529] Received request cmpl-cba3bae644234c86b209b60b0e93273b-0: prompt: '你好', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [57668, 53901], lora_request: LoRARequest(lora_name='test-lora', lora_int_id=1, lora_local_path='/mnt/sda/edison/llama3/checkpoint-441'). INFO 05-08 11:38:43 async_llm_engine.py:154] Aborted request cmpl-cba3bae644234c86b209b60b0e93273b-0. INFO: 192.168.31.138:55550 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 150, in _load_lora lora = self._lora_model_cls.from_local_checkpoint( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/lora/models.py", line 246, in from_local_checkpoint return cls.from_lora_tensors( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/lora/models.py", line 150, in from_lora_tensors module_name, is_lora_a = parse_fine_tuned_lora_name(tensor_name) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/lora/utils.py", line 89, in parse_fine_tuned_lora_name assert parts[0] == "base_model" AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/opt/conda/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app response = await func(request) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(values) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion return await self.chat_completion_full_generator( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator async for res in result_generator: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 666, in generate raise e File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 660, in generate async for request_output in stream: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 77, in anext raise result File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/opt/conda/envs/vllm/lib/python3.10/asyncio/tasks.py", line 445, in wait_for return fut.result() File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step request_outputs = await self.engine.step_async() File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 221, in step_async output = await self.model_executor.execute_model_async( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 148, in execute_model_async output = await make_async(self.driver_worker.execute_model File "/opt/conda/envs/vllm/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, *self.kwargs) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 249, in execute_model output = self.model_runner.execute_model(seq_group_metadata_list, File "/opt/conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 790, in execute_model self.set_active_loras(lora_requests, lora_mapping) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 901, in set_active_loras self.lora_manager.set_active_loras(lora_requests, lora_mapping) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 113, in set_active_loras self._apply_loras(lora_requests) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 235, in _apply_loras self.add_lora(lora) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 243, in add_lora lora = self._load_lora(lora_request) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 162, in _load_lora raise RuntimeError( RuntimeError: Loading lora /mnt/sda/edison/llama3/checkpoint-441 failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/opt/conda/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app response = await func(request) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 114, in create_completion generator = await openai_serving_completion.create_completion( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_completion.py", line 154, in create_completion async for i, res in result_generator: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 240, in consumer raise e File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 233, in consumer raise item File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 217, in producer async for item in iterator: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 666, in generate raise e File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 650, in generate stream = await self.add_request( File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 537, in add_request self.start_background_loop() File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 411, in start_background_loop raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

Edisonwei54 commented 5 months ago

When I use ··· Curl http://0.0.0.0:8008/v1/completions \ -H "Content Type: application/JSON"\ -D '{ "Model": "test lora", "Prompt": "Hello", "Max_tokens": 128, "Temperature": 0.7 }| jq ···

Edisonwei54 commented 5 months ago

@WoosukKwon @zhuohan123

DarkLight1337 commented 5 months ago

You need to use the --served-model-name argument to set the name of your model. Otherwise, you can only refer to it via the value passed to --model (in your example, it is /mnt/sda/edison/llama3/Meta-Llama-3-8B-Instruct)

kubernetes-bad commented 5 months ago

This has nothing to do with model name, FYI. vLLM expects LoRA adapter tensors to have base_model.model.lm_head.lora_A.weight and base_model.model.lm_head.lora_B.weight, while some adapters just have base_model.model.lm_head.weight and fail said assert (vllm/lora/utils.py#L89).