vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.78k stars 3.92k forks source link

[Usage]: When debugging with vLLM, a CUDA error occurs. #7817

Open kinglion811 opened 3 weeks ago

kinglion811 commented 3 weeks ago

vllm latest

I add some logger in /vllm/model_executor/models/llama.py ,I want to print the attention ,like that

image

if I start llm server,the error is

rank0: During handling of the above exception, another exception occurred:

rank0: Traceback (most recent call last): rank0: File "/usr/local/bin/vllm", line 8, in

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 148, in main

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 28, in serve

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server rank0: if llm_engine is not None else AsyncLLMEngine.from_engine_args( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 467, in from_engine_args rank0: engine = cls( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 548, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in init

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 105, in initialize_cache rank0: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 220, in initialize_cache

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 236, in _warm_up_model

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1173, in capture_model

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1410, in capture rank0: with torch.cuda.graph(self._graph, pool=memory_pool, stream=stream): rank0: File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 184, in exit

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 82, in capture_end

rank0: RuntimeError: CUDA error: operation failed due to a previous error during capture rank0: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

How would you like to use vllm

I want to debug vllm

kinglion811 commented 3 weeks ago

@youkaichao @WoosukKwon

jeejeelee commented 3 weeks ago

CUDA graph can not use print. You can try setting enforce_eager=True

kinglion811 commented 3 weeks ago

@jeejeelee it can work by set --enforce-eager