[Speculative decoding]: `AttributeError: 'NoneType' object has no attribute 'numel'` when exceeding draft context length

zhangxy1234 commented 3 months ago

Your current environment

vllm-0.4.3

🐛 Describe the bug

When I use the speculative mode and prompt_length+output_length > 2048, the error occurs

When I use the speculative mode, I use the following parameters, which produces this error: Parameters： Base_model: llama2-70B Speculative_model: llama1.1b engine_args = EngineArgs( model=base_path, tokenizer=base_path, trust_remote_code=True, tensor_parallel_size=4, gpu_memory_utilization=0.90, enforce_eager=True, speculative_model=draft_path, num_speculative_tokens=4, dtype=torch.float16, use_v2_block_manager=True ) And the prompt tokens length is 2040 the output tokens is 50 ( when prompt_length+output_length > 2048 ，the error occurs )

Error: RayWorkerWrapper pid=142935) 07d92dd514d4:142935:143254 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_3:1 Got async event : GID table change (RayWorkerWrapper pid=142935) (RayWorkerWrapper pid=142935) 07d92dd514d4:142935:143254 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_3:1 Got async event : GID table change (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] Traceback (most recent call last): [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker_base.py", line 140, in execute_method [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return executor(*args, *kwargs) [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 4x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return func(args, **kwargs) [repeated 4x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/spec_decode/spec_decode_worker.py", line 297, in start_worker_execution_loop [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] while self._run_non_driver_rank(): [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/spec_decode/spec_decode_worker.py", line 366, in _run_non_driver_rank [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] self.proposer_worker.execute_model() [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker.py", line 230, in execute_model [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] self._execute_model_non_driver() [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker.py", line 303, in _execute_model_non_driver [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] self.cache_swap(blocks_to_swap_in, blocks_to_swap_out, blocks_to_copy) [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker.py", line 217, in cache_swap [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] if blocks_to_swap_in.numel() > 0: [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] AttributeError: 'NoneType' object has no attribute 'numel' [repeated 2x across cluster]

07d92dd514d4:135382:143236 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : GID table change

zhangxy1234 commented 3 months ago

@cadedaniel

cadedaniel commented 3 months ago

Hi @zhangxy1234 . can you confirm something for me -- what is the max context len supported by your draft model ?

zhangxy1234 commented 3 months ago

Hi @zhangxy1234 . can you confirm something for me -- what is the max context len supported by your draft model ?

draft model is 2048 and base model is 4096

when tp = 1 ，it will stop in 2048 but when tp >1 ， it can not stop in 2048 but raise this error

2024-06-15 10:58:41.179 (RayWorkerWrapper (RayWorkerWrapper (RayWorkerWrapper (RayWorkerWrapper (RayWorkerWrapper (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) │ (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) (RayWorkerWrapper | CRITICAL | vllm.worker.worker:_execute_model_non_driver:303 - data {'num_seq_groups': 1, 'blocks_to_swap_in': tensor([], size=(0, 2), dtype=torch.int64), 'blocks_to_swap_out': tensor([], size=(0, 2), dtype=torch.int64), 'blocks_to_copy': tensor([], device='cuda:1', size=(0, 2), dtype=torch.int64)} pid=7582) 2024-06-15 10:58:41.180 | INFO | vllm.worker.worker:cache_swap:220 - cache_swap blocks_to_swap_in tensor([], size=(0, 2), dtype=torch.int64) pid=7582) 2024-06-15 10:58:41.193 | CRITICAL | vllm.worker.worker:_execute_model_non_driver:303 - data {'num_lookahead_slots': 5, 'disable_all_speculation': False} pid=7582) 2024-06-15 10:58:41.193 | INFO | vllm.worker.worker:cache_swap:220 - cache_swap blocks_to_swap_in None pid=7582) 2024-06-15 10:58:41.193 | ERROR | vllm.worker.worker_base:execute_method:148 - Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. pid=7582) Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 289, in worker.main_loop() └ <function Worker.main_loop at 0x7fd2a5357040> └ <ray._private.worker.Worker object at 0x7fd2a5350670> File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/worker.py", line 876, in main_loop self.core_worker.run_task_loop() │ │ └ <method 'run_task_loop' of 'ray._raylet.CoreWorker' objects> │ └ <ray._raylet.CoreWorker object at 0x7fd2a42f5220> └ <ray._private.worker.Worker object at 0x7fd2a5350670> File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/function_manager.py", line 691, in actor_method_executor return method(__ray_actor, *args, kwargs) │ │ └ {} │ └ ('start_worker_execution_loop',) └ <function WorkerWrapperBase.execute_method at 0x7fd2045aaa60> File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span return method(self, *_args, *_kwargs) │ │ │ └ {} │ │ └ ('start_worker_execution_loop',) │ └ <vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2045ab760> └ <function WorkerWrapperBase.execute_method at 0x7fd204720820> pid=7582) > File "vllm-main/vllm/worker/worker_base.py", line 140, in execute_method return executor(args, kwargs) │ │ └ {} │ └ () └ <bound method SpecDecodeWorker.start_worker_execution_loop of <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at... File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) │ │ └ {} │ └ (<vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0>,) └ <function SpecDecodeWorker.start_worker_execution_loop at 0x7fa4083899d0> File "vllm-main/vllm/spec_decode/spec_decode_worker.py", line 300, in start_worker_execution_loop while self._run_non_driver_rank(): │ └ <function SpecDecodeWorker._run_non_driver_rank at 0x7fa408389d30> └ <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0> File "vllm-main/vllm/spec_decode/spec_decode_worker.py", line 369, in _run_non_driver_rank self.proposer_worker.execute_model() │ │ └ <function Worker.execute_model at 0x7fa4083863a0> │ └ <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610> └ <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0> File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) │ │ └ {} │ └ (<vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>,) └ <function Worker.execute_model at 0x7fa408386310> File " vllm-main/vllm/worker/worker.py", line 236, in execute_model self._execute_model_non_driver() │ └ <function Worker._execute_model_non_driver at 0x7fa408386550> └ <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610> File "vllm-main/vllm/worker/worker.py", line 311, in _execute_model_non_driver self.cache_swap(blocks_to_swap_in, blocks_to_swap_out, blocks_to_copy) │ │ │ │ └ None │ │ │ └ None │ │ └ None │ └ <function Worker.cache_swap at 0x7fa408386280> └ <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610> File "vllm-main/vllm/worker/worker.py", line 223, in cache_swap if blocks_to_swap_in.numel() > 0: └ None pid=7582) AttributeError: 'NoneType' object has no attribute 'numel'

@cadedaniel

njhill commented 2 months ago

@zhangxy1234 could you confirm whether you still encounter this error with the latest version of vLLM? (0.5.3.post1)

vllm-project / vllm

[Speculative decoding]: `AttributeError: 'NoneType' object has no attribute 'numel'` when exceeding draft context length #5342

Your current environment

🐛 Describe the bug