vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.78k stars 3.92k forks source link

[Bug]:vllm backed with triton server is not working #6697

Open nirmesh opened 1 month ago

nirmesh commented 1 month ago

Your current environment

The output of `curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'

πŸ› Describe the bug

I have used below command to start the container: docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:24.06-vllm-python-py3 tritonserver --model-repository ./model_repository however when i am trying curl like below: $ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'

This is the log i am seeing. here is the official doc i am following: https://github.com/triton-inference-server/vllm_backend

2024-07-23 19:39:35 I0723 14:09:35.070084 1 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001" 2024-07-23 19:39:35 I0723 14:09:35.070413 1 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000" 2024-07-23 19:39:35 I0723 14:09:35.119975 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002" 2024-07-23 19:39:36 W0723 14:09:36.049714 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000" 2024-07-23 19:39:37 W0723 14:09:37.055894 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000" 2024-07-23 19:39:38 W0723 14:09:38.056658 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000" 2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] Engine background task failed

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] Traceback (most recent call last):

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] task.result()

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] has_requests_in_progress = await asyncio.wait_for(

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] return fut.result()

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] request_outputs = await self.engine.step_async()

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] output = await self.model_executor.execute_model_async(

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 101, in execute_model_async

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] output = await make_async(self.driver_worker.execute_model

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] result = self.fn(*self.args, **self.kwargs)

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] return func(*args, **kwargs)

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 302, in execute_model

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] output = self.model_runner.execute_model(seq_group_metadata_list,

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] return func(*args, **kwargs)

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 320, in execute_model

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] ) = self.prepare_input_tensors(seq_group_metadata_list)

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 270, in prepare_input_tensors

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] ) = self._prepare_prompt(seq_group_metadata_list)

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 158, in _prepare_prompt

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] attn_metadata = self.attn_backend.make_metadata(

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 29, in make_metadata

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] return FlashAttentionMetadata(*args, **kwargs)

2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] TypeError: FlashAttentionMetadata.init() got an unexpected keyword argument 'is_prompt' 2024-07-23 19:51:51 ERROR:asyncio:Exception in callback _raise_exception_on_finish(error_callback=>)(<Task finishe...'is_prompt'")>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:32 2024-07-23 19:51:51 handle: <Handle _raise_exception_on_finish(error_callback=>)(<Task finishe...'is_prompt'")>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:32> 2024-07-23 19:51:51 Traceback (most recent call last): 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish 2024-07-23 19:51:51 task.result() 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop 2024-07-23 19:51:51 has_requests_in_progress = await asyncio.wait_for( 2024-07-23 19:51:51 File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for 2024-07-23 19:51:51 return fut.result() 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step 2024-07-23 19:51:51 request_outputs = await self.engine.step_async() 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async 2024-07-23 19:51:51 output = await self.model_executor.execute_model_async( 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 101, in execute_model_async 2024-07-23 19:51:51 output = await make_async(self.driver_worker.execute_model 2024-07-23 19:51:51 File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-07-23 19:51:51 result = self.fn(*self.args, self.kwargs) 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-07-23 19:51:51 return func(*args, *kwargs) 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 302, in execute_model 2024-07-23 19:51:51 output = self.model_runner.execute_model(seq_group_metadata_list, 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-07-23 19:51:51 return func(args, kwargs) 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 320, in execute_model 2024-07-23 19:51:51 ) = self.prepare_input_tensors(seq_group_metadata_list) 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 270, in prepare_input_tensors 2024-07-23 19:51:51 ) = self._prepare_prompt(seq_group_metadata_list) 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 158, in _prepare_prompt 2024-07-23 19:51:51 attn_metadata = self.attn_backend.make_metadata( 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 29, in make_metadata 2024-07-23 19:51:51 return FlashAttentionMetadata(*args, *kwargs) 2024-07-23 19:51:51 TypeError: FlashAttentionMetadata.init() got an unexpected keyword argument 'is_prompt' 2024-07-23 19:51:51 2024-07-23 19:51:51 The above exception was the direct cause of the following exception: 2024-07-23 19:51:51 2024-07-23 19:51:51 Traceback (most recent call last): 2024-07-23 19:51:51 File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run 2024-07-23 19:51:51 self._context.run(self._callback, self._args) 2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _raise_exception_on_finish 2024-07-23 19:51:51 raise AsyncEngineDeadError( 2024-07-23 19:51:51 vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. 2024-07-23 19:51:51 I0723 14:21:51.620381 1 model.py:406] "[vllm] Error generating stream: FlashAttentionMetadata.init() got an unexpected keyword argument 'is_prompt'"

ywang96 commented 1 month ago

IMO this should be a question for the nvidia Triton team - it seems to me that they didn't update the version properly because of

FlashAttentionMetadata.init() got an unexpected keyword argument 'is_prompt'"