[Usage]: vllm serving llama3.2-3B process down when api call (/chat/completions )

Your current environment

on WSL CPU : 11th Gen Intel(R) Core(TM) i7-11600H @ 2.90GHz 2.92 GHz RAM : 32GB

Collecting environment information...
WARNING 11-07 13:10:53 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/home/jason/vllm/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.0 (default, Nov  7 2024, 09:05:24) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU
Nvidia driver version: 565.90
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   39 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          12
On-line CPU(s) list:             0-11
Vendor ID:                       GenuineIntel
Model name:                      11th Gen Intel(R) Core(TM) i7-11600H @ 2.90GHz
CPU family:                      6
Model:                           141
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
Stepping:                        1
BogoMIPS:                        5836.96
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       288 KiB (6 instances)
L1i cache:                       192 KiB (6 instances)
L2 cache:                        7.5 MiB (6 instances)
L3 cache:                        18 MiB (1 instance)
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A (dev)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

Hi guys, i just stuck in problem when api call . (v0.6.3.post1)

I ran process as below :

 vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --chat-template examples/tool_chat_template_llama3.2_json.jinja \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --dtype auto \
    --host localhost \
    --port 8008 \
    --max-num-batched-tokens 2048 \
    --uvicorn-log-level debug \
    --max-model-len 2048 \
    --device cpu \
    --disable-sliding-window \
    --disable-log-requests \
    --load-format auto \
    --swap-space 2

The process starts successfully, and api (/v1/models) works well too. But the problem is when i call api (/v1/chat/completion) , it throw error and process died. as below.

INFO:     Started server process [21026]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on socket ('0.0.0.0', 8008) (Press CTRL+C to quit)
INFO 11-07 10:49:47 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-07 10:49:57 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-07 10:50:07 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
ERROR 11-07 10:50:10 engine.py:158] TypeError("'NoneType' object is not subscriptable")
ERROR 11-07 10:50:10 engine.py:158] Traceback (most recent call last):
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 156, in start
ERROR 11-07 10:50:10 engine.py:158]     self.run_engine_loop()
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 219, in run_engine_loop
ERROR 11-07 10:50:10 engine.py:158]     request_outputs = self.engine_step()
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 237, in engine_step
ERROR 11-07 10:50:10 engine.py:158]     raise e
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 228, in engine_step
ERROR 11-07 10:50:10 engine.py:158]     return self.engine.step()
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1389, in step
ERROR 11-07 10:50:10 engine.py:158]     outputs = self.model_executor.execute_model(
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/executor/cpu_executor.py", line 229, in execute_model
ERROR 11-07 10:50:10 engine.py:158]     output = self.driver_method_invoker(self.driver_worker,
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/executor/cpu_executor.py", line 385, in _driver_method_invoker
ERROR 11-07 10:50:10 engine.py:158]     return getattr(driver, method)(*args, **kwargs)
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 303, in execute_model
ERROR 11-07 10:50:10 engine.py:158]     inputs = self.prepare_input(execute_model_req)
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 291, in prepare_input
ERROR 11-07 10:50:10 engine.py:158]     return self._get_driver_input_and_broadcast(execute_model_req)
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
ERROR 11-07 10:50:10 engine.py:158]     self.model_runner.prepare_model_input(
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/cpu_model_runner.py", line 489, in prepare_model_input
ERROR 11-07 10:50:10 engine.py:158]     model_input = self._prepare_model_input_tensors(
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/cpu_model_runner.py", line 477, in _prepare_model_input_tensors
ERROR 11-07 10:50:10 engine.py:158]     return builder.build()  # type: ignore
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/cpu_model_runner.py", line 131, in build
ERROR 11-07 10:50:10 engine.py:158]     multi_modal_kwargs) = self._prepare_prompt(
ERROR 11-07 10:50:10 engine.py:158]   File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/worker/cpu_model_runner.py", line 244, in _prepare_prompt
ERROR 11-07 10:50:10 engine.py:158]     block_number = block_table[i //
ERROR 11-07 10:50:10 engine.py:158] TypeError: 'NoneType' object is not subscriptable
INFO:     127.0.0.1:39774 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 315, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion(
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 624, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/utils.py", line 458, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/jason/.pyenv/versions/3.10.0/envs/vllm_venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 598, in _process_request
    raise request_output
TypeError: 'NoneType' object is not subscriptable
ERROR 11-07 10:50:20 client.py:250] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 11-07 10:50:20 client.py:250] NoneType: None
^CINFO 11-07 10:55:32 launcher.py:57] Shutting down FastAPI HTTP server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.

I think chat-template or something about parser are one of reason that The value of block_table is None .

Is there anyone in same problem? or am i doing something wrong ?

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm