vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.02k stars 3.81k forks source link

[Bug]: Spec. decode fails for requests with n>1 or best_of>1 #6137

Open tdoublep opened 2 months ago

tdoublep commented 2 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35

Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 550.54.15
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8474C
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 48
Socket(s):                          2
Stepping:                           8
CPU max MHz:                        3800.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4200.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          4.5 MiB (96 instances)
L1i cache:                          3 MiB (96 instances)
L2 cache:                           192 MiB (96 instances)
L3 cache:                           195 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-47,96-143
NUMA node1 CPU(s):                  48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] sentence-transformers==3.0.1
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] sentence-transformers     3.0.1                    pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] transformers              4.41.2                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX PIX 0-47,96-143 0       N/A
GPU1    NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    SYS SYS 0-47,96-143 0       N/A
GPU2    NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    SYS SYS 0-47,96-143 0       N/A
GPU3    NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    SYS SYS 0-47,96-143 0       N/A
GPU4    NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    SYS SYS 48-95,144-191   1       N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    SYS SYS 48-95,144-191   1       N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    SYS SYS 48-95,144-191   1       N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  SYS SYS 48-95,144-191   1       N/A
NIC0    PIX SYS SYS SYS SYS SYS SYS SYS  X  PIX             
NIC1    PIX SYS SYS SYS SYS SYS SYS SYS PIX  X              

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

🐛 Describe the bug

If one sends a request with n>1 to a server with spec. decode enabled, the request with fail with an unhelpful error message.

To reproduce, start an inference server with:

python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-13b-chat-hf \
    --speculative_model https://huggingface.co/ibm-fms/llama-13b-accelerator \
    --use-v2-block-manager \
    --enforce-eager

and then send a request via:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Completion API
stream = True
completion = client.completions.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=False,
    n=2,
    stream=stream)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

The client will see:

Traceback (most recent call last):
  File "/home/user/vllm/send_request.py", line 27, in <module>
    for c in completion:
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/openai/_streaming.py", line 46, in __iter__
    for item in self._iterator:
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/openai/_streaming.py", line 58, in __stream__
    for sse in iterator:
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/openai/_streaming.py", line 50, in _iter_events
    yield from self._decoder.iter_bytes(self.response.iter_bytes())
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/openai/_streaming.py", line 280, in iter_bytes
    for chunk in self._iter_chunks(iterator):
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/openai/_streaming.py", line 291, in _iter_chunks
    for chunk in iterator:
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/httpx/_models.py", line 829, in iter_bytes
    for raw_bytes in self.iter_raw():
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/httpx/_models.py", line 883, in iter_raw
    for raw_stream_bytes in self.stream:
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/httpx/_client.py", line 126, in __iter__
    for chunk in self._stream:
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/httpx/_transports/default.py", line 112, in __iter__
    with map_httpcore_exceptions():
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

and the error on the server-side is:

    | Traceback (most recent call last):
    |   File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap
    |     await func()
    |   File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/starlette/responses.py", line 250, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/home/user/vllm/vllm/entrypoints/openai/serving_completion.py", line 222, in completion_stream_generator
    |     async for prompt_idx, res in result_generator:
    |   File "/home/user/vllm/vllm/utils.py", line 319, in consumer
    |     raise e
    |   File "/home/user/vllm/vllm/utils.py", line 310, in consumer
    |     raise item
    |   File "/home/user/vllm/vllm/utils.py", line 294, in producer
    |     async for item in iterator:
    |   File "/home/user/vllm/vllm/engine/async_llm_engine.py", line 746, in generate
    |     async for output in self._process_request(
    |   File "/home/user/vllm/vllm/engine/async_llm_engine.py", line 859, in _process_request
    |     raise e
    |   File "/home/user/vllm/vllm/engine/async_llm_engine.py", line 855, in _process_request
    |     async for request_output in stream:
    |   File "/home/user/vllm/vllm/engine/async_llm_engine.py", line 90, in __anext__
    |     raise result
    |   File "/home/user/vllm/vllm/engine/async_llm_engine.py", line 43, in _log_task_completion
    |     return_value = task.result()
    |                    ^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/engine/async_llm_engine.py", line 595, in run_engine_loop
    |     result = task.result()
    |              ^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/engine/async_llm_engine.py", line 540, in engine_step
    |     request_outputs = await self.engine.step_async(virtual_engine)
    |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/engine/async_llm_engine.py", line 241, in step_async
    |     output = await self.model_executor.execute_model_async(
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/executor/gpu_executor.py", line 122, in execute_model_async
    |     output = await make_async(self.driver_worker.execute_model
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/miniforge3/envs/dev-env/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    |     result = self.fn(*self.args, **self.kwargs)
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    |     return func(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/spec_decode/spec_decode_worker.py", line 338, in execute_model
    |     return self._run_no_spec(execute_model_req,
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/miniforge3/envs/dev-env/lib/python3.11/contextlib.py", line 81, in inner
    |     return func(*args, **kwds)
    |            ^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/spec_decode/spec_decode_worker.py", line 386, in _run_no_spec
    |     sampler_output = self.scorer_worker.execute_model(execute_model_req)
    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/worker/worker_base.py", line 271, in execute_model
    |     output = self.model_runner.execute_model(
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    |     return func(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/worker/model_runner.py", line 1245, in execute_model
    |     output: SamplerOutput = self.model.sample(
    |                             ^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/model_executor/models/llama.py", line 416, in sample
    |     next_tokens = self.sampler(logits, sampling_metadata)
    |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/miniforge3/envs/dev-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/model_executor/layers/sampler.py", line 96, in forward
    |     sample_results, maybe_sampled_tokens_tensor = _sample(
    |                                                   ^^^^^^^^
    |   File "/home/user/vllm/vllm/model_executor/layers/sampler.py", line 658, in _sample
    |     return _sample_with_torch(
    |            ^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/vllm/vllm/model_executor/layers/sampler.py", line 528, in _sample_with_torch
    |     sampled_token_ids_tensor[
    | RuntimeError: shape mismatch: value tensor of shape [2] cannot be broadcast to indexing result of shape [1, 1]
    +------------------------------------
Hongtao-Xu commented 2 months ago

I encountered the same bug when n=3