vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.88k stars 3.95k forks source link

[Bug]: Unable to use speculative decoding (KeyError: 40) #7907

Closed ccdv-ai closed 2 weeks ago

ccdv-ai commented 3 weeks ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... WARNING 08-27 11:01:10 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information. PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.6 Libc version: glibc-2.35 Python version: 3.9.19 (main, May 6 2024, 19:43:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.66 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40 GPU 1: NVIDIA L40 GPU 2: NVIDIA L40 GPU 3: NVIDIA L40 Nvidia driver version: 535.183.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: AuthenticAMD Model name: AMD EPYC 9124 16-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 Stepping: 1 BogoMIPS: 5991.11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Virtualization: AMD-V L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 32 MiB (32 instances) L3 cache: 128 MiB (8 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] flashinfer==0.1.1+cu121torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.3 [pip3] pyzmq==26.0.3 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.43.3 [pip3] triton==3.0.0 [pip3] zmq==0.0.0 [conda] blas 1.0 mkl [conda] flashinfer 0.1.1+cu121torch2.3 pypi_0 pypi [conda] mkl 2023.1.0 h213fc3f_46344 [conda] mkl-service 2.4.0 py39h5eee18b_1 [conda] mkl_fft 1.3.8 py39h5eee18b_0 [conda] mkl_random 1.2.4 py39hdb19cb5_0 [conda] numpy 1.26.4 py39h5f9d8c6_0 [conda] numpy-base 1.26.4 py39hb5e798b_0 [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.555.43 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.5.82 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pynvml 11.5.3 pypi_0 pypi [conda] pyzmq 26.0.3 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.43.3 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi [conda] zmq 0.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE SYS SYS 0-15,32-47 0 N/A GPU1 NODE X SYS SYS 0-15,32-47 0 N/A GPU2 SYS SYS X NODE 16-31,48-63 1 N/A GPU3 SYS SYS NODE X 16-31,48-63 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

I'm not able to use speculative decoding whatever the method is. Downgrading to previous vllm version doesn't fix the issue.

python -u -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --model models/Meta-Llama-3.1-70B-Instruct-FP8 \
    --dtype "auto" \
    --port 8000 \
    --seed 123 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --tensor-parallel-size 4 \
    --max-num-seqs 32 \
    --use-v2-block-manager \
    --max-log-len 20 \
    --served-model-name llama \
    --speculative_model "[ngram]" \
    --num_speculative_tokens 5 \
    --ngram_prompt_lookup_max 8 \
    --ngram_prompt_lookup_min 1

Engine starts:

INFO 08-27 01:18:22 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/181452bd-e5ed-43fd-9b35-da2bf817442d for RPC Path.
INFO 08-27 01:18:22 api_server.py:161] Started engine process with PID 267269
WARNING 08-27 01:18:24 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information.
INFO 08-27 01:18:27 config.py:813] Defaulting to use mp for distributed inference
INFO 08-27 01:18:27 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='/home/user/codes/models/Meta-Llama-3.1-70B-Instruct-FP8', speculative_config=SpeculativeConfig(draft_model='[ngram]', num_spec_tokens=5), tokenizer='/home/user/codes/models/Meta-Llama-3.1-70B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=123, served_model_name=llama, use_v2_block_manager=True, enable_prefix_caching=False)
WARNING 08-27 01:18:27 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-27 01:18:27 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 08-27 01:18:27 spec_decode_worker.py:161] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.ngram_worker.NGramWorker'>
INFO 08-27 01:18:27 spec_decode_worker.py:175] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:27 spec_decode_worker.py:161] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.ngram_worker.NGramWorker'>
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:27 spec_decode_worker.py:161] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.ngram_worker.NGramWorker'>
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:27 spec_decode_worker.py:175] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:27 spec_decode_worker.py:175] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:27 spec_decode_worker.py:161] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.ngram_worker.NGramWorker'>
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:27 spec_decode_worker.py:175] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:28 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:29 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-27 01:18:29 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:29 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:29 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-27 01:18:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=267340) WARNING 08-27 01:18:29 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-27 01:18:29 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=267338) WARNING 08-27 01:18:29 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=267339) WARNING 08-27 01:18:29 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-27 01:18:29 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fa9d96da850>, local_subscribe_port=41297, remote_subscribe_port=None)
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:29 model_runner.py:879] Starting to load model /home/user/codes/models/Meta-Llama-3.1-70B-Instruct-FP8...
INFO 08-27 01:18:29 model_runner.py:879] Starting to load model /home/user/codes/models/Meta-Llama-3.1-70B-Instruct-FP8...
WARNING 08-27 01:18:29 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=267340) WARNING 08-27 01:18:29 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:29 model_runner.py:879] Starting to load model /home/user/codes/models/Meta-Llama-3.1-70B-Instruct-FP8...
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:29 model_runner.py:879] Starting to load model /home/user/codes/models/Meta-Llama-3.1-70B-Instruct-FP8...
(VllmWorkerProcess pid=267338) WARNING 08-27 01:18:29 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=267339) WARNING 08-27 01:18:29 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.

Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:00<00:07,  1.88it/s]

Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:01<00:07,  1.80it/s]

Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:01<00:06,  1.83it/s]

Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:02<00:06,  1.82it/s]

Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:02<00:05,  1.87it/s]

Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:03<00:04,  1.93it/s]

Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:03<00:03,  2.14it/s]

Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:03<00:03,  2.17it/s]

Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:04<00:02,  2.09it/s]

Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:05<00:02,  2.00it/s]

Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:05<00:01,  2.32it/s]

Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:05<00:01,  2.28it/s]

Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:06<00:00,  2.18it/s]

Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:06<00:00,  2.05it/s]

Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:07<00:00,  2.00it/s]

Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:07<00:00,  2.03it/s]

(VllmWorkerProcess pid=267338) INFO 08-27 01:18:37 model_runner.py:890] Loading model weights took 16.9520 GB
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:37 model_runner.py:890] Loading model weights took 16.9520 GB
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:37 model_runner.py:890] Loading model weights took 16.9520 GB
INFO 08-27 01:18:37 model_runner.py:890] Loading model weights took 16.9520 GB
INFO 08-27 01:18:43 distributed_gpu_executor.py:56] # GPU blocks: 17015, # CPU blocks: 3276
INFO 08-27 01:18:46 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-27 01:18:46 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:46 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:46 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:47 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:47 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:47 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:47 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=267338) INFO 08-27 01:18:51 model_runner.py:1300] Graph capturing finished in 4 secs.
(VllmWorkerProcess pid=267340) INFO 08-27 01:18:51 model_runner.py:1300] Graph capturing finished in 4 secs.
(VllmWorkerProcess pid=267339) INFO 08-27 01:18:51 model_runner.py:1300] Graph capturing finished in 4 secs.
INFO 08-27 01:18:51 model_runner.py:1300] Graph capturing finished in 4 secs.
INFO 08-27 01:18:51 api_server.py:209] vLLM to use /tmp/tmp2wkzg5n8 as PROMETHEUS_MULTIPROC_DIR
WARNING 08-27 01:18:51 serving_embedding.py:188] embedding_mode is False. Embedding API will not work.
INFO 08-27 01:18:51 launcher.py:20] Available routes are:
INFO 08-27 01:18:51 launcher.py:28] Route: /openapi.json, Methods: HEAD, GET
INFO 08-27 01:18:51 launcher.py:28] Route: /docs, Methods: HEAD, GET
INFO 08-27 01:18:51 launcher.py:28] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-27 01:18:51 launcher.py:28] Route: /redoc, Methods: HEAD, GET
INFO 08-27 01:18:51 launcher.py:28] Route: /health, Methods: GET
INFO 08-27 01:18:51 launcher.py:28] Route: /tokenize, Methods: POST
INFO 08-27 01:18:51 launcher.py:28] Route: /detokenize, Methods: POST
INFO 08-27 01:18:51 launcher.py:28] Route: /v1/models, Methods: GET
INFO 08-27 01:18:51 launcher.py:28] Route: /version, Methods: GET
INFO 08-27 01:18:51 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 08-27 01:18:51 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 08-27 01:18:51 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 08-27 01:18:51 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [267198]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Error after receiving a request (tested with [ngram] and a draft model):

ERROR 08-27 01:19:33 async_llm_engine.py:65] Engine background task failed
ERROR 08-27 01:19:33 async_llm_engine.py:65] Traceback (most recent call last):
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
ERROR 08-27 01:19:33 async_llm_engine.py:65]     return_value = task.result()
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 930, in run_engine_loop
ERROR 08-27 01:19:33 async_llm_engine.py:65]     result = task.result()
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 873, in engine_step
ERROR 08-27 01:19:33 async_llm_engine.py:65]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 337, in step_async
ERROR 08-27 01:19:33 async_llm_engine.py:65]     output = await self.model_executor.execute_model_async(
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async
ERROR 08-27 01:19:33 async_llm_engine.py:65]     return await self._driver_execute_model_async(execute_model_req)
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 224, in _driver_execute_model_async
ERROR 08-27 01:19:33 async_llm_engine.py:65]     return await self.driver_exec_model(execute_model_req)
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/concurrent/futures/thread.py", line 58, in run
ERROR 08-27 01:19:33 async_llm_engine.py:65]     result = self.fn(*self.args, **self.kwargs)
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-27 01:19:33 async_llm_engine.py:65]     return func(*args, **kwargs)
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/spec_decode/spec_decode_worker.py", line 404, in execute_model
ERROR 08-27 01:19:33 async_llm_engine.py:65]     return self._run_speculative_decoding_step(execute_model_req,
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/contextlib.py", line 79, in inner
ERROR 08-27 01:19:33 async_llm_engine.py:65]     return func(*args, **kwds)
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/spec_decode/spec_decode_worker.py", line 583, in _run_speculative_decoding_step
ERROR 08-27 01:19:33 async_llm_engine.py:65]     proposal_scores = self.scorer.score_proposals(
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/contextlib.py", line 79, in inner
ERROR 08-27 01:19:33 async_llm_engine.py:65]     return func(*args, **kwds)
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/spec_decode/batch_expansion.py", line 85, in score_proposals
ERROR 08-27 01:19:33 async_llm_engine.py:65]     target_sampler_output = self._scorer_worker.execute_model(
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 322, in execute_model
ERROR 08-27 01:19:33 async_llm_engine.py:65]     output = self.model_runner.execute_model(
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-27 01:19:33 async_llm_engine.py:65]     return func(*args, **kwargs)
ERROR 08-27 01:19:33 async_llm_engine.py:65]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1399, in execute_model
ERROR 08-27 01:19:33 async_llm_engine.py:65]     model_executable = self.graph_runners[virtual_engine][
ERROR 08-27 01:19:33 async_llm_engine.py:65] KeyError: 40
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: 40, Traceback (most recent call last):
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/spec_decode/spec_decode_worker.py", line 411, in start_worker_execution_loop
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     while self._run_non_driver_rank():
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/spec_decode/spec_decode_worker.py", line 548, in _run_non_driver_rank
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     self.scorer_worker.execute_model()
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 322, in execute_model
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1399, in execute_model
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     model_executable = self.graph_runners[virtual_engine][
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226] KeyError: 40
(VllmWorkerProcess pid=267338) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: 40, Traceback (most recent call last):
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/spec_decode/spec_decode_worker.py", line 411, in start_worker_execution_loop
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     while self._run_non_driver_rank():
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/spec_decode/spec_decode_worker.py", line 548, in _run_non_driver_rank
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     self.scorer_worker.execute_model()
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 322, in execute_model
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]   File "/opt/anaconda/envs/transformers/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1399, in execute_model
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226]     model_executable = self.graph_runners[virtual_engine][
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226] KeyError: 40
(VllmWorkerProcess pid=267340) ERROR 08-27 01:19:33 multiproc_worker_utils.py:226] 
Exception in callback functools.partial(<function _log_task_completion at 0x7fa9f6f11310>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fa9dab04fa0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7fa9f6f11310>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fa9dab04fa0>>)>

Before submitting a new issue...

youkaichao commented 2 weeks ago

should be solved by https://github.com/vllm-project/vllm/pull/7894