I add some logger in /vllm/model_executor/models/llama.py ,I want to print the attention ,like that
if I start llm server,the error is
rank0: During handling of the above exception, another exception occurred:
rank0: Traceback (most recent call last):
rank0: File "/usr/local/bin/vllm", line 8, in
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 148, in main
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 28, in serve
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
rank0: if llm_engine is not None else AsyncLLMEngine.from_engine_args(
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 467, in from_engine_args
rank0: engine = cls(
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in initrank0: self.engine = self._init_engine(*args, *kwargs)
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 548, in _init_engine
rank0: return engine_class(args, **kwargs)
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in init
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches
rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 105, in initialize_cache
rank0: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 220, in initialize_cache
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 236, in _warm_up_model
rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
rank0: return func(*args, **kwargs)
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1173, in capture_model
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1410, in capture
rank0: with torch.cuda.graph(self._graph, pool=memory_pool, stream=stream):
rank0: File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 184, in exit
rank0: File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 82, in capture_end
rank0: RuntimeError: CUDA error: operation failed due to a previous error during capture
rank0: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
vllm latest
I add some logger in /vllm/model_executor/models/llama.py ,I want to print the attention ,like that
if I start llm server,the error is
rank0: During handling of the above exception, another exception occurred:
rank0: Traceback (most recent call last): rank0: File "/usr/local/bin/vllm", line 8, in
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 148, in main
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 28, in serve
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server rank0: if llm_engine is not None else AsyncLLMEngine.from_engine_args( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 467, in from_engine_args rank0: engine = cls( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 548, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in init
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 105, in initialize_cache rank0: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 220, in initialize_cache
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 236, in _warm_up_model
rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1173, in capture_model
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1410, in capture rank0: with torch.cuda.graph(self._graph, pool=memory_pool, stream=stream): rank0: File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 184, in exit
rank0: File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 82, in capture_end
rank0: RuntimeError: CUDA error: operation failed due to a previous error during capture rank0: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.How would you like to use vllm
I want to debug vllm