The output of `curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
π Describe the bug
I have used below command to start the container:
docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:24.06-vllm-python-py3 tritonserver --model-repository ./model_repository
however when i am trying curl like below:
$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
Your current environment
π Describe the bug
I have used below command to start the container: docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:24.06-vllm-python-py3 tritonserver --model-repository ./model_repository however when i am trying curl like below: $ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
This is the log i am seeing. here is the official doc i am following: https://github.com/triton-inference-server/vllm_backend
2024-07-23 19:39:35 I0723 14:09:35.070084 1 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001" 2024-07-23 19:39:35 I0723 14:09:35.070413 1 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000" 2024-07-23 19:39:35 I0723 14:09:35.119975 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002" 2024-07-23 19:39:36 W0723 14:09:36.049714 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000" 2024-07-23 19:39:37 W0723 14:09:37.055894 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000" 2024-07-23 19:39:38 W0723 14:09:38.056658 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000" 2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] Engine background task failed
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] Traceback (most recent call last):
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] task.result()
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] has_requests_in_progress = await asyncio.wait_for(
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] return fut.result()
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] request_outputs = await self.engine.step_async()
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] output = await self.model_executor.execute_model_async(
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 101, in execute_model_async
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] output = await make_async(self.driver_worker.execute_model
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] result = self.fn(*self.args, **self.kwargs)
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] return func(*args, **kwargs)
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 302, in execute_model
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] output = self.model_runner.execute_model(seq_group_metadata_list,
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] return func(*args, **kwargs)
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 320, in execute_model
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] ) = self.prepare_input_tensors(seq_group_metadata_list)
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 270, in prepare_input_tensors
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] ) = self._prepare_prompt(seq_group_metadata_list)
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 158, in _prepare_prompt
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] attn_metadata = self.attn_backend.make_metadata(
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 29, in make_metadata
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] return FlashAttentionMetadata(*args, **kwargs)
2024-07-23 19:51:51 ERROR 07-23 14:21:51 async_llm_engine.py:45] TypeError: FlashAttentionMetadata.init() got an unexpected keyword argument 'is_prompt' 2024-07-23 19:51:51 ERROR:asyncio:Exception in callback _raise_exception_on_finish(error_callback=>)(<Task finishe...'is_prompt'")>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:32
2024-07-23 19:51:51 handle: <Handle _raise_exception_on_finish(error_callback=>)(<Task finishe...'is_prompt'")>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:32>
2024-07-23 19:51:51 Traceback (most recent call last):
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
2024-07-23 19:51:51 task.result()
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
2024-07-23 19:51:51 has_requests_in_progress = await asyncio.wait_for(
2024-07-23 19:51:51 File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
2024-07-23 19:51:51 return fut.result()
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
2024-07-23 19:51:51 request_outputs = await self.engine.step_async()
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
2024-07-23 19:51:51 output = await self.model_executor.execute_model_async(
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 101, in execute_model_async
2024-07-23 19:51:51 output = await make_async(self.driver_worker.execute_model
2024-07-23 19:51:51 File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2024-07-23 19:51:51 result = self.fn(*self.args, self.kwargs)
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-07-23 19:51:51 return func(*args, *kwargs)
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 302, in execute_model
2024-07-23 19:51:51 output = self.model_runner.execute_model(seq_group_metadata_list,
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-07-23 19:51:51 return func(args, kwargs)
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 320, in execute_model
2024-07-23 19:51:51 ) = self.prepare_input_tensors(seq_group_metadata_list)
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 270, in prepare_input_tensors
2024-07-23 19:51:51 ) = self._prepare_prompt(seq_group_metadata_list)
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 158, in _prepare_prompt
2024-07-23 19:51:51 attn_metadata = self.attn_backend.make_metadata(
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 29, in make_metadata
2024-07-23 19:51:51 return FlashAttentionMetadata(*args, *kwargs)
2024-07-23 19:51:51 TypeError: FlashAttentionMetadata.init() got an unexpected keyword argument 'is_prompt'
2024-07-23 19:51:51
2024-07-23 19:51:51 The above exception was the direct cause of the following exception:
2024-07-23 19:51:51
2024-07-23 19:51:51 Traceback (most recent call last):
2024-07-23 19:51:51 File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
2024-07-23 19:51:51 self._context.run(self._callback, self._args)
2024-07-23 19:51:51 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _raise_exception_on_finish
2024-07-23 19:51:51 raise AsyncEngineDeadError(
2024-07-23 19:51:51 vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
2024-07-23 19:51:51 I0723 14:21:51.620381 1 model.py:406] "[vllm] Error generating stream: FlashAttentionMetadata.init() got an unexpected keyword argument 'is_prompt'"