vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.94k stars 3.25k forks source link

[Bug]: RuntimeError: out must have shape (total_q, num_heads, head_size_og) #5499

Open zhihui96 opened 1 month ago

zhihui96 commented 1 month ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-25-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe

Nvidia driver version: 535.54.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7452 32-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 2350.0000
CPU min MHz: 1500.0000
BogoMIPS: 4699.91
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 2 MiB (64 instances)
L1i cache: 2 MiB (64 instances)
L2 cache: 32 MiB (64 instances)
L3 cache: 256 MiB (16 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE 32-63,96-127 1 N/A
GPU1 NODE X NODE 32-63,96-127 1 N/A
NIC0 NODE NODE X

Legend:

  X = Self
  SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX = Connection traversing at most a single PCIe bridge
  NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0```

🐛 Describe the bug

I ran vllm in k8s with the following yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.5.0
          imagePullPolicy: Always
          ports:
            - containerPort: 8000
          env:
          - name: VLLM_PORT
             value: "9000"
          volumeMounts:
          - name: shm
            mountPath: /dev/shm
          resources:
            limits:
              nvidia.com/gpu: 2
          command: ["python3"]
          args: ["-m", "vllm.entrypoints.api_server", "--model", "/path/to/llama2-70b", "--tensor-parallel-size", "2", "--enable-prefix-caching"]
      nodeSelector:
        node_type: A_GPU
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 12Gi

Then I conducted throughput testing on it, and got the following error. The error occurs sporadically, and I cannot consistently reproduce it.

ERROR 06-13 09:55:29 async_llm_engine.py:52] Engine background task failed
ERROR 06-13 09:55:29 async_llm_engine.py:52] Traceback (most recent call last):
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return_value = task.result()
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 529, in run_engine_loop
ERROR 06-13 09:55:29 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return fut.result()
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 503, in engine_step
ERROR 06-13 09:55:29 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
ERROR 06-13 09:55:29 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return await self._driver_execute_model_async(execute_model_req)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return await self.driver_exec_model(execute_model_req)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-13 09:55:29 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
ERROR 06-13 09:55:29 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 738, in execute_model
ERROR 06-13 09:55:29 async_llm_engine.py:52]     hidden_states = model_executable(
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 371, in forward
ERROR 06-13 09:55:29 async_llm_engine.py:52]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 288, in forward
ERROR 06-13 09:55:29 async_llm_engine.py:52]     hidden_states, residual = layer(
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 227, in forward
ERROR 06-13 09:55:29 async_llm_engine.py:52]     hidden_states = self.self_attn(
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 161, in forward
ERROR 06-13 09:55:29 async_llm_engine.py:52]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 338, in forward
ERROR 06-13 09:55:29 async_llm_engine.py:52]     flash_attn_varlen_func(
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return FlashAttnVarlenFunc.apply(
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
ERROR 06-13 09:55:29 async_llm_engine.py:52]     return super().apply(*args, **kwargs)  # type: ignore[misc]
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
ERROR 06-13 09:55:29 async_llm_engine.py:52]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
ERROR 06-13 09:55:29 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
ERROR 06-13 09:55:29 async_llm_engine.py:52]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
ERROR 06-13 09:55:29 async_llm_engine.py:52] RuntimeError: out must have shape (total_q, num_heads, head_size_og)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: out must have shape (total_q, num_heads, head_size_og), Traceback (most recent call last):
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 286, in start_worker_execution_loop
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     while self._execute_model_non_driver():
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 309, in _execute_model_non_driver
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     self.model_runner.execute_model(None, self.gpu_cache)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 738, in execute_model
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     hidden_states = model_executable(
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 371, in forward
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 288, in forward
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     hidden_states, residual = layer(
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 227, in forward
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 161, in forward
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 338, in forward
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     flash_attn_varlen_func(
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return FlashAttnVarlenFunc.apply(
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     return super().apply(*args, **kwargs)  # type: ignore[misc]
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226]     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226] RuntimeError: out must have shape (total_q, num_heads, head_size_og)
(VllmWorkerProcess pid=7288) ERROR 06-13 09:55:29 multiproc_worker_utils.py:226] 
Exception in callback functools.partial(<function _log_task_completion at 0x7efcb98b7f40>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7efd4736ae60>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7efcb98b7f40>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7efd4736ae60>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 529, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 503, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 738, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 371, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 288, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 227, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 161, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 338, in forward
    flash_attn_varlen_func(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: out must have shape (total_q, num_heads, head_size_og)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 06-13 09:55:29 async_llm_engine.py:167] Aborted request 4453d57709b9400b828a1501ac0039dd.
INFO:     127.0.0.6:60827 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/api_server.py", line 68, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 670, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 777, in _process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 773, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 89, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 529, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 503, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 738, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 371, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 288, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 227, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 161, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 338, in forward
    flash_attn_varlen_func(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: out must have shape (total_q, num_heads, head_size_og)
guliujian commented 1 month ago

also get same bug on 4 A10 card with Qwen2-72B-Instruct-GPTQ-Int4 , --gpu-memory-utilization=0.9 --enable-prefix-caching

simon-mo commented 1 month ago

We pushed a hotfix for the published image, please re-pull/deploy. For the other installation, you can get around it by using VLLM_ATTENTION_BACKEND=XFORMERS. This is also addressed in main branch, and we will push out a fix release soon.

zhihui96 commented 1 month ago

It helps for me, thanks.

sfc-gh-zhwang commented 2 weeks ago

@simon-mo can you link the PR for the hotfix? https://github.com/vllm-project/vllm/pull/5476?