Open rdaiello opened 4 months ago
same +1
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
Seems similar to #6192 but confirmed that the proper version of flash infer is installed.
Command
export VLLM_ATTENTION_BACKEND=FLASHINFER python -m vllm.entrypoints.openai.api_server \ --model google/gemma-2-27b-it \ --tensor-parallel-size 2
Model fails to load with this error:
(VllmWorkerProcess pid=4091308) ERROR 07-12 15:20:49 multiproc_worker_utils.py:226] rank0: Traceback (most recent call last): rank0: File "/cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "/cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in
rank0: engine = AsyncLLMEngine.from_engine_args(
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
rank0: engine = cls(
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 360, in init
rank0: self.engine = self._init_engine(*args, *kwargs)
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
rank0: return engine_class(args, **kwargs)
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 256, in init
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 366, in _initialize_kv_caches rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, **kwargs) rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/worker.py", line 230, in _warm_up_model
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1109, in capture_model
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1339, in capture rank0: with torch.cuda.graph(self._graph, pool=memory_pool, stream=stream): rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/cuda/graphs.py", line 184, in exit
rank0: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/cuda/graphs.py", line 82, in capture_end
rank0: RuntimeError: CUDA error: out of memory rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank0: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.ERROR 07-12 15:20:50 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 4091308 died, exit code: -15 INFO 07-12 15:20:50 multiproc_worker_utils.py:123] Killing local vLLM worker processes rank0:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] /cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '