vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.68k stars 4.08k forks source link

[Speculative decoding]: `AttributeError: 'NoneType' object has no attribute 'numel'` when exceeding draft context length #5342

Open zhangxy1234 opened 3 months ago

zhangxy1234 commented 3 months ago

Your current environment

vllm-0.4.3

πŸ› Describe the bug

When I use the speculative mode and prompt_length+output_length > 2048, the error occurs

When I use the speculative mode, I use the following parameters, which produces this error: Parameters: Base_model: llama2-70B Speculative_model: llama1.1b engine_args = EngineArgs( model=base_path, tokenizer=base_path, trust_remote_code=True, tensor_parallel_size=4, gpu_memory_utilization=0.90, enforce_eager=True, speculative_model=draft_path, num_speculative_tokens=4, dtype=torch.float16, use_v2_block_manager=True ) And the prompt tokens length is 2040 the output tokens is 50 ( when prompt_length+output_length > 2048 ,the error occurs )

Error: RayWorkerWrapper pid=142935) 07d92dd514d4:142935:143254 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_3:1 Got async event : GID table change (RayWorkerWrapper pid=142935) (RayWorkerWrapper pid=142935) 07d92dd514d4:142935:143254 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_3:1 Got async event : GID table change (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] Traceback (most recent call last): [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker_base.py", line 140, in execute_method [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return executor(*args, *kwargs) [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 4x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return func(args, **kwargs) [repeated 4x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/spec_decode/spec_decode_worker.py", line 297, in start_worker_execution_loop [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] while self._run_non_driver_rank(): [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/spec_decode/spec_decode_worker.py", line 366, in _run_non_driver_rank [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] self.proposer_worker.execute_model() [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker.py", line 230, in execute_model [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] self._execute_model_non_driver() [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker.py", line 303, in _execute_model_non_driver [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] self.cache_swap(blocks_to_swap_in, blocks_to_swap_out, blocks_to_copy) [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker.py", line 217, in cache_swap [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] if blocks_to_swap_in.numel() > 0: [repeated 2x across cluster] (RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] AttributeError: 'NoneType' object has no attribute 'numel' [repeated 2x across cluster]

07d92dd514d4:135382:143236 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : GID table change

zhangxy1234 commented 3 months ago

@cadedaniel

cadedaniel commented 3 months ago

Hi @zhangxy1234 . can you confirm something for me -- what is the max context len supported by your draft model ?

zhangxy1234 commented 3 months ago

Hi @zhangxy1234 . can you confirm something for me -- what is the max context len supported by your draft model ?

draft model is 2048 and base model is 4096

when tp = 1 ,it will stop in 2048 but when tp >1 , it can not stop in 2048 but raise this error

2024-06-15 10:58:41.179 | CRITICAL | vllm.worker.worker:_execute_model_non_driver:303 - data {'num_seq_groups': 1, 'blocks_to_swap_in': tensor([], size=(0, 2), dtype=torch.int64), 'blocks_to_swap_out': tensor([], size=(0, 2), dtype=torch.int64), 'blocks_to_copy': tensor([], device='cuda:1', size=(0, 2), dtype=torch.int64)} (RayWorkerWrapper pid=7582) 2024-06-15 10:58:41.180 | INFO | vllm.worker.worker:cache_swap:220 - cache_swap blocks_to_swap_in tensor([], size=(0, 2), dtype=torch.int64) (RayWorkerWrapper pid=7582) 2024-06-15 10:58:41.193 | CRITICAL | vllm.worker.worker:_execute_model_non_driver:303 - data {'num_lookahead_slots': 5, 'disable_all_speculation': False} (RayWorkerWrapper pid=7582) 2024-06-15 10:58:41.193 | INFO | vllm.worker.worker:cache_swap:220 - cache_swap blocks_to_swap_in None (RayWorkerWrapper pid=7582) 2024-06-15 10:58:41.193 | ERROR | vllm.worker.worker_base:execute_method:148 - Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. (RayWorkerWrapper pid=7582) Traceback (most recent call last): (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 289, in (RayWorkerWrapper pid=7582) worker.main_loop() (RayWorkerWrapper pid=7582) β”‚ β”” <function Worker.main_loop at 0x7fd2a5357040> (RayWorkerWrapper pid=7582) β”” <ray._private.worker.Worker object at 0x7fd2a5350670> (RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/worker.py", line 876, in main_loop (RayWorkerWrapper pid=7582) self.core_worker.run_task_loop() (RayWorkerWrapper pid=7582) β”‚ β”‚ β”” <method 'run_task_loop' of 'ray._raylet.CoreWorker' objects> (RayWorkerWrapper pid=7582) β”‚ β”” <ray._raylet.CoreWorker object at 0x7fd2a42f5220> (RayWorkerWrapper pid=7582) β”” <ray._private.worker.Worker object at 0x7fd2a5350670> (RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/_private/function_manager.py", line 691, in actor_method_executor (RayWorkerWrapper pid=7582) return method(__ray_actor, *args, kwargs) (RayWorkerWrapper pid=7582) β”‚ β”‚ β”” {} (RayWorkerWrapper pid=7582) β”‚ β”” ('start_worker_execution_loop',) (RayWorkerWrapper pid=7582) β”” <function WorkerWrapperBase.execute_method at 0x7fd2045aaa60> (RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span (RayWorkerWrapper pid=7582) return method(self, *_args, *_kwargs) (RayWorkerWrapper pid=7582) β”‚ β”‚ β”‚ β”” {} (RayWorkerWrapper pid=7582) β”‚ β”‚ β”” ('start_worker_execution_loop',) (RayWorkerWrapper pid=7582) β”‚ β”” <vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2045ab760> (RayWorkerWrapper pid=7582) β”” <function WorkerWrapperBase.execute_method at 0x7fd204720820> (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) > File "vllm-main/vllm/worker/worker_base.py", line 140, in execute_method (RayWorkerWrapper pid=7582) return executor(args, kwargs) (RayWorkerWrapper pid=7582) β”‚ β”‚ β”” {} (RayWorkerWrapper pid=7582) β”‚ β”” () (RayWorkerWrapper pid=7582) β”” <bound method SpecDecodeWorker.start_worker_execution_loop of <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at... (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerWrapper pid=7582) return func(*args, *kwargs) (RayWorkerWrapper pid=7582) β”‚ β”‚ β”” {} (RayWorkerWrapper pid=7582) β”‚ β”” (<vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0>,) (RayWorkerWrapper pid=7582) β”” <function SpecDecodeWorker.start_worker_execution_loop at 0x7fa4083899d0> (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) File "vllm-main/vllm/spec_decode/spec_decode_worker.py", line 300, in start_worker_execution_loop (RayWorkerWrapper pid=7582) while self._run_non_driver_rank(): (RayWorkerWrapper pid=7582) β”‚ β”” <function SpecDecodeWorker._run_non_driver_rank at 0x7fa408389d30> (RayWorkerWrapper pid=7582) β”” <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0> (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) File "vllm-main/vllm/spec_decode/spec_decode_worker.py", line 369, in _run_non_driver_rank (RayWorkerWrapper pid=7582) self.proposer_worker.execute_model() (RayWorkerWrapper pid=7582) β”‚ β”‚ β”” <function Worker.execute_model at 0x7fa4083863a0> (RayWorkerWrapper pid=7582) β”‚ β”” <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610> (RayWorkerWrapper pid=7582) β”” <vllm.spec_decode.spec_decode_worker.SpecDecodeWorker object at 0x7fa40838c6a0> (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerWrapper pid=7582) return func(args, **kwargs) (RayWorkerWrapper pid=7582) β”‚ β”‚ β”” {} (RayWorkerWrapper pid=7582) β”‚ β”” (<vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>,) (RayWorkerWrapper pid=7582) β”” <function Worker.execute_model at 0x7fa408386310> (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) File " vllm-main/vllm/worker/worker.py", line 236, in execute_model (RayWorkerWrapper pid=7582) self._execute_model_non_driver() (RayWorkerWrapper pid=7582) β”‚ β”” <function Worker._execute_model_non_driver at 0x7fa408386550> (RayWorkerWrapper pid=7582) β”” <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610> (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) File "vllm-main/vllm/worker/worker.py", line 311, in _execute_model_non_driver (RayWorkerWrapper pid=7582) self.cache_swap(blocks_to_swap_in, blocks_to_swap_out, blocks_to_copy) (RayWorkerWrapper pid=7582) β”‚ β”‚ β”‚ β”‚ β”” None (RayWorkerWrapper pid=7582) β”‚ β”‚ β”‚ β”” None (RayWorkerWrapper pid=7582) β”‚ β”‚ β”” None (RayWorkerWrapper pid=7582) β”‚ β”” <function Worker.cache_swap at 0x7fa408386280> (RayWorkerWrapper pid=7582) β”” <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610> (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) File "vllm-main/vllm/worker/worker.py", line 223, in cache_swap (RayWorkerWrapper pid=7582) if blocks_to_swap_in.numel() > 0: (RayWorkerWrapper pid=7582) β”” None (RayWorkerWrapper pid=7582) (RayWorkerWrapper pid=7582) AttributeError: 'NoneType' object has no attribute 'numel'

@cadedaniel

njhill commented 2 months ago

@zhangxy1234 could you confirm whether you still encounter this error with the latest version of vLLM? (0.5.3.post1)