vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.19k stars 4.57k forks source link

[Usage]: vllm serving llama3.2-3B process down when api call (/chat/completions ) #10098

Closed jasonkim5672 closed 6 days ago

jasonkim5672 commented 1 week ago

Your current environment

on WSL CPU : 11th Gen Intel(R) Core(TM) i7-11600H @ 2.90GHz 2.92 GHz RAM : 32GB

Collecting environment information...
WARNING 11-07 13:10:53 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/home/jason/vllm/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.0 (default, Nov  7 2024, 09:05:24) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU
Nvidia driver version: 565.90
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   39 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          12
On-line CPU(s) list:             0-11
Vendor ID:                       GenuineIntel
Model name:                      11th Gen Intel(R) Core(TM) i7-11600H @ 2.90GHz
CPU family:                      6
Model:                           141
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
Stepping:                        1
BogoMIPS:                        5836.96
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       288 KiB (6 instances)
L1i cache:                       192 KiB (6 instances)
L2 cache:                        7.5 MiB (6 instances)
L3 cache:                        18 MiB (1 instance)
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A (dev)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

Hi guys, i just stuck in problem when api call . (v0.6.3.post1)


I ran process as below :

 vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --chat-template examples/tool_chat_template_llama3.2_json.jinja \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --dtype auto \
    --host localhost \
    --port 8008 \
    --max-num-batched-tokens 2048 \
    --uvicorn-log-level debug \
    --max-model-len 2048 \
    --device cpu \
    --disable-sliding-window \
    --disable-log-requests \
    --load-format auto \
    --swap-space 2

The process starts successfully, and api (/v1/models) works well too. But the problem is when i call api (/v1/chat/completion) , it throw error and process died. as below.

INFO:     Started server process [21026]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on socket ('0.0.0.0', 8008) (Press CTRL+C to quit)
INFO 11-07 10:49:47 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-07 10:49:57 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-07 10:50:07 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
ERROR 11-07 10:50:10 engine.py:158] TypeError("'NoneType' object is not subscriptable")
ERROR 11-07 10:50:10 engine.py:158] Traceback (most recent call last):
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 156, in start
ERROR 11-07 10:50:10 engine.py:158]     self.run_engine_loop()
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 219, in run_engine_loop
ERROR 11-07 10:50:10 engine.py:158]     request_outputs = self.engine_step()
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 237, in engine_step
ERROR 11-07 10:50:10 engine.py:158]     raise e
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 228, in engine_step
ERROR 11-07 10:50:10 engine.py:158]     return self.engine.step()
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1389, in step
ERROR 11-07 10:50:10 engine.py:158]     outputs = self.model_executor.execute_model(
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/executor/cpu_executor.py", line 229, in execute_model
ERROR 11-07 10:50:10 engine.py:158]     output = self.driver_method_invoker(self.driver_worker,
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/executor/cpu_executor.py", line 385, in _driver_method_invoker
ERROR 11-07 10:50:10 engine.py:158]     return getattr(driver, method)(*args, **kwargs)
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 303, in execute_model
ERROR 11-07 10:50:10 engine.py:158]     inputs = self.prepare_input(execute_model_req)
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 291, in prepare_input
ERROR 11-07 10:50:10 engine.py:158]     return self._get_driver_input_and_broadcast(execute_model_req)
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
ERROR 11-07 10:50:10 engine.py:158]     self.model_runner.prepare_model_input(
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/cpu_model_runner.py", line 489, in prepare_model_input
ERROR 11-07 10:50:10 engine.py:158]     model_input = self._prepare_model_input_tensors(
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/cpu_model_runner.py", line 477, in _prepare_model_input_tensors
ERROR 11-07 10:50:10 engine.py:158]     return builder.build()  # type: ignore
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/cpu_model_runner.py", line 131, in build
ERROR 11-07 10:50:10 engine.py:158]     multi_modal_kwargs) = self._prepare_prompt(
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/cpu_model_runner.py", line 244, in _prepare_prompt
ERROR 11-07 10:50:10 engine.py:158]     block_number = block_table[i //
ERROR 11-07 10:50:10 engine.py:158] TypeError: 'NoneType' object is not subscriptable
INFO:     127.0.0.1:39774 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 315, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion(
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 624, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/utils.py", line 458, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 598, in _process_request
    raise request_output
TypeError: 'NoneType' object is not subscriptable
ERROR 11-07 10:50:20 client.py:250] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 11-07 10:50:20 client.py:250] NoneType: None
^CINFO 11-07 10:55:32 launcher.py:57] Shutting down FastAPI HTTP server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.

I think chat-template or something about parser are one of reason that The value of block_table is None .

Is there anyone in same problem? or am i doing something wrong ?

Before submitting a new issue...

jikunshang commented 1 week ago

Why you set device to cpu since you have cuda device on your machine.

jasonkim5672 commented 1 week ago

Why you set device to cpu since you have cuda device on your machine.

@jikunshang I tried with --device cuda , but it throws error about CUDA(mem 4GB) OOM and process start failed. that's why i'm trying cpu instead. could it be the reason ?

jikunshang commented 1 week ago

if you choose to use cpu, you need install vllm-cpu binary instead. default binary doesn't support run on CPU.