vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.76k stars 3.92k forks source link

[Bug]: TypeError: 'NoneType' object is not callable when start Gemma2-27b-it #6445

Open candowu opened 2 months ago

candowu commented 2 months ago

Your current environment

Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Clang version: Could not collect CMake version: version 3.30.0 Libc version: glibc-2.17

Python version: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.99.1.el7.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB GPU 2: NVIDIA A100-SXM4-40GB GPU 3: NVIDIA A100-SXM4-40GB

Nvidia driver version: 545.23.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7K62 48-Core Processor Stepping: 0 CPU MHz: 2600.000 CPU max MHz: 2600.0000 CPU min MHz: 1500.0000 BogoMIPS: 5190.52 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate ssbd rsb_ctxsw ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca

Versions of relevant libraries: [pip3] flashinfer==0.0.8+cu118torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchvision==0.18.0 [pip3] transformers==4.42.4 [pip3] triton==2.3.0 [conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.1.2+cu118 pypi_0 pypi [conda] torchaudio 2.1.2+cu118 pypi_0 pypi [conda] torchvision 0.16.2+cu118 pypi_0 pypi [conda] triton 2.1.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 PXB 0-47,96-143 0 N/A GPU1 NV12 X NV12 NV12 SYS 48-95,144-191 1 N/A GPU2 NV12 NV12 X NV12 SYS 48-95,144-191 1 N/A GPU3 NV12 NV12 NV12 X SYS 48-95,144-191 1 N/A NIC0 PXB SYS SYS SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

🐛 Describe the bug

(VllmWorkerProcess pid=120786) INFO 07-15 21:21:13 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=120786) WARNING 07-15 21:21:13 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
(VllmWorkerProcess pid=120786) INFO 07-15 21:21:13 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=120787) INFO 07-15 21:21:13 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=120787) WARNING 07-15 21:21:13 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
(VllmWorkerProcess pid=120787) INFO 07-15 21:21:13 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 07-15 21:21:13 selector.py:79] Using Flashinfer backend.
WARNING 07-15 21:21:13 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
(VllmWorkerProcess pid=120788) INFO 07-15 21:21:13 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=120788) WARNING 07-15 21:21:13 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
(VllmWorkerProcess pid=120788) INFO 07-15 21:21:13 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=120786) INFO 07-15 21:21:18 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-15 21:21:18 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=120786) INFO 07-15 21:21:18 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-15 21:21:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=120788) INFO 07-15 21:21:18 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=120787) INFO 07-15 21:21:18 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=120788) INFO 07-15 21:21:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=120787) INFO 07-15 21:21:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=120787) INFO 07-15 21:21:21 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/candowu/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-15 21:21:21 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/candowu/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorkerProcess pid=120788) INFO 07-15 21:21:21 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/candowu/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(VllmWorkerProcess pid=120786) INFO 07-15 21:21:21 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/candowu/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-15 21:21:21 selector.py:79] Using Flashinfer backend.
WARNING 07-15 21:21:21 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
(VllmWorkerProcess pid=120788) INFO 07-15 21:21:21 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=120786) INFO 07-15 21:21:21 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=120788) WARNING 07-15 21:21:21 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
(VllmWorkerProcess pid=120786) WARNING 07-15 21:21:21 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
(VllmWorkerProcess pid=120787) INFO 07-15 21:21:21 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=120787) WARNING 07-15 21:21:21 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
(VllmWorkerProcess pid=120788) INFO 07-15 21:21:30 model_runner.py:255] Loading model weights took 12.8146 GB
INFO 07-15 21:21:30 model_runner.py:255] Loading model weights took 12.8146 GB
(VllmWorkerProcess pid=120786) INFO 07-15 21:21:30 model_runner.py:255] Loading model weights took 12.8146 GB
(VllmWorkerProcess pid=120787) INFO 07-15 21:21:30 model_runner.py:255] Loading model weights took 12.8146 GB
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: 'NoneType' object is not callable, Traceback (most recent call last):
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 173, in determine_num_available_blocks
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     self.model_runner.profile_run()
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 874, in profile_run
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in execute_model
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 336, in forward
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 277, in forward
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     hidden_states, residual = layer(
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]                               ^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 221, in forward
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]                     ^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 162, in forward
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/attention/layer.py", line 94, in forward
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]   File "/workspace/poetry-cache-dir/virtualenvs/vllm-v0-5-1-E2tKv0jo-py3.11/lib/python3.11/site-packages/vllm/attention/backends/flashinfer.py", line 260, in forward
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]     output = flash_attn_varlen_func(
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=120787) ERROR 07-15 21:21:34 multiproc_worker_utils.py:226] TypeError: 'NoneType' object is not callable
mgoin commented 2 months ago

Hi @candowu, there are some conflicting versions of pytorch/triton between your conda and pypi environments, maybe this is an issue?

Versions of relevant libraries:
[pip3] flashinfer==0.0.8+cu118torch2.3
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.4
[pip3] triton==2.3.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] torch 2.1.2+cu118 pypi_0 pypi
[conda] torchaudio 2.1.2+cu118 pypi_0 pypi
[conda] torchvision 0.16.2+cu118 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypi
ArlanCooper commented 2 months ago

same issue,+1

duguwanglong commented 2 months ago

Gemma2-7b-it is same issue,+1

weiminw commented 2 months ago

same issue, +1

mgoin commented 2 months ago

Make sure you have: an up-to-date vLLM installed, vllm-flash-attn installed, and flashinfer installed:

> pip list | grep "vllm\|flash"
flashinfer                        0.0.9+cu121torch2.3
vllm                              0.5.2
vllm-flash-attn                   2.5.9.post1

The model loads and infers fine for me with this configuration:

> VLLM_ATTENTION_BACKEND=FLASHINFER python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm import LLM
>>> model = LLM("google/gemma-2-27b-it")
WARNING 07-16 16:19:50 utils.py:558] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 07-16 16:19:50 llm_engine.py:174] Initializing an LLM engine (v0.5.2) with config: model='google/gemma-2-27b-it', speculative_config=None, tokenizer='google/gemma-2-27b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=google/gemma-2-27b-it, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-16 16:19:51 selector.py:79] Using Flashinfer backend.
INFO 07-16 16:19:53 selector.py:79] Using Flashinfer backend.
INFO 07-16 16:19:54 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-16 16:20:02 model_runner.py:266] Loading model weights took 50.8043 GB
INFO 07-16 16:20:03 gpu_executor.py:86] # GPU blocks: 3072, # CPU blocks: 712
INFO 07-16 16:20:05 model_runner.py:1007] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-16 16:20:05 model_runner.py:1011] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-16 16:20:13 model_runner.py:1208] Graph capturing finished in 8 secs.
>>> model.generate("Hello!")
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.10it/s, est. speed input: 6.30 toks/s, output: 33.62 toks/s]
[RequestOutput(request_id=0, prompt='Hello!', prompt_token_ids=[2, 4521, 235341], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' I am so excited to share this recipe with you. It is so simple yet', token_ids=(590, 1144, 712, 12826, 577, 4638, 736, 14758, 675, 692, 235265, 1165, 603, 712, 3890, 3599), cumulative_logprob=-19.540222738403827, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1721146833.3287368, last_token_time=1721146833.3287368, first_scheduled_time=1721146833.344057, first_token_time=1721146833.4124975, time_in_queue=0.015320301055908203, finished_time=1721146833.8196554), lora_request=None)]
HelloCard commented 1 month ago

same issue ModelCloud/gemma-2-27b-it-gptq-4bit export VLLM_ATTENTION_BACKEND=FLASHINFER (some old errlog) I will try to install flashinfer... pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3 (base) root@DESKTOP-O6DNFE1:/mnt/c/Windows/system32# pip list | grep "vllm|flash" flashinfer 0.1.1+cu121torch2.3 vllm 0.5.2 vllm-flash-attn 2.5.9.post1

(base) root@DESKTOP-O6DNFE1:/mnt/c/Windows/system32# python3 -m vllm.entrypoints.openai.api_server --model /mnt/e/Code/models/gemma-2-27b-it-gptq-4bit --quantization gptq --max-model-len 4096 --tensor-parallel-size 2 --max-num-seqs=1 --gpu-memory-utilization 0.85 --dtype=half
INFO 07-22 22:39:35 api_server.py:212] vLLM API server version 0.5.2
INFO 07-22 22:39:35 api_server.py:213] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/mnt/e/Code/models/gemma-2-27b-it-gptq-4bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.85, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization='gptq', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 07-22 22:39:35 config.py:1378] Casting torch.bfloat16 to torch.float16.
WARNING 07-22 22:39:35 utils.py:558] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
WARNING 07-22 22:39:35 config.py:241] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 07-22 22:39:35 config.py:695] Defaulting to use mp for distributed inference
INFO 07-22 22:39:35 llm_engine.py:174] Initializing an LLM engine (v0.5.2) with config: model='/mnt/e/Code/models/gemma-2-27b-it-gptq-4bit', speculative_config=None, tokenizer='/mnt/e/Code/models/gemma-2-27b-it-gptq-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/mnt/e/Code/models/gemma-2-27b-it-gptq-4bit, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-22 22:39:36 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=6989) WARNING 07-22 22:39:36 utils.py:558] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
WARNING 07-22 22:39:36 utils.py:558] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=6989) INFO 07-22 22:39:36 selector.py:79] Using Flashinfer backend.
INFO 07-22 22:39:36 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=6989) INFO 07-22 22:39:36 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=6989) INFO 07-22 22:39:36 utils.py:737] Found nccl from library libnccl.so.2
INFO 07-22 22:39:36 utils.py:737] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=6989) INFO 07-22 22:39:36 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-22 22:39:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=6989) INFO 07-22 22:39:37 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 07-22 22:39:37 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 07-22 22:39:37 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=6989) INFO 07-22 22:39:37 selector.py:79] Using Flashinfer backend.
(VllmWorkerProcess pid=6989) INFO 07-22 22:43:37 model_runner.py:266] Loading model weights took 7.6332 GB
INFO 07-22 22:43:38 model_runner.py:266] Loading model weights took 7.6332 GB
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 282, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 224, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 444, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 373, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 520, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 135, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 923, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1341, in execute_model[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:                                     ^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 336, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 277, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 221, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:                     ^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 162, in forward
[rank0]:     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/attention/layer.py", line 96, in forward
[rank0]:     return self.impl.forward(query,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm/attention/backends/flashinfer.py", line 266, in forward
[rank0]:     output = flash_attn_varlen_func(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
[rank0]:     return FlashAttnVarlenFunc.apply(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 598, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
[rank0]:     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
[rank0]:                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
[rank0]:     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
[rank0]:                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: FlashAttention only supports Ampere GPUs or newer.
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: FlashAttention only supports Ampere GPUs or newer., Traceback (most recent call last):
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     self.model_runner.profile_run()
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 923, in profile_run
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1341, in execute_model
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 336, in forward
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 277, in forward
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     hidden_states, residual = layer(
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]                               ^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 221, in forward
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]                     ^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 162, in forward
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/attention/layer.py", line 96, in forward
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return self.impl.forward(query,
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm/attention/backends/flashinfer.py", line 266, in forward
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     output = flash_attn_varlen_func(
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return FlashAttnVarlenFunc.apply(
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 598, in apply
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]     return super().apply(*args, **kwargs)  # type: ignore[misc]
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=6989) ERROR 07-22 22:43:39 multiproc_worker_utils.py:226]   File "/root/miniconda3/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

FlashAttention only supports Ampere GPUs or newer. No! I just want to run gemma-2-27b-it-gptq-4bit, why so difficult

dogeeelin commented 1 month ago

same issue+1

kzos commented 1 month ago

Installing flashinfer and exporting its path worked.