[Bug]: Llama 3.1 8b-instruct on V-100 (Volta): model loads, but completion request fails

🐛 Describe the bug

I'm trying to tun Llama 3.1-8b-instruct on V-100 (Volta). Model loads fine with the following settings:

    command: >
      --host 0.0.0.0
      --dtype=half
      --tensor-parallel-size 8
      --enforce-eager
      --num-scheduler-steps 8
      --gpu_memory_utilization 0.95
      --enable-chunked-prefill=false
      --max-model-len=4096
      --trust-remote-code
      --served-model-name Llama-3.1-8b
      --model /models/meta-llama/Meta-Llama-3.1-8B-Instruct

Any chat completion request crashes:

ama-8b-1  |
llama-8b-1  | During handling of the above exception, another exception occurred:
llama-8b-1  |
llama-8b-1  | Traceback (most recent call last):
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
llama-8b-1  |     result = await app(  # type: ignore[func-returns-value]
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
llama-8b-1  |     return await self.app(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
llama-8b-1  |     await super().__call__(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
llama-8b-1  |     await self.middleware_stack(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
llama-8b-1  |     raise exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
llama-8b-1  |     await self.app(scope, receive, _send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 189, in __call__
llama-8b-1  |     with collapse_excgroups():
llama-8b-1  |   File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
llama-8b-1  |     self.gen.throw(typ, value, traceback)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 89, in collapse_excgroups
llama-8b-1  |     raise exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 191, in __call__
llama-8b-1  |     response = await self.dispatch_func(request, call_next)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 361, in authentication
llama-8b-1  |     return await call_next(request)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 165, in call_next
llama-8b-1  |     raise app_exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 151, in coro
llama-8b-1  |     await self.app(scope, receive_or_disconnect, send_no_error)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
llama-8b-1  |     await self.app(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
llama-8b-1  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
llama-8b-1  |     raise exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
llama-8b-1  |     await app(scope, receive, sender)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
llama-8b-1  |     await self.middleware_stack(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
llama-8b-1  |     await route.handle(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
llama-8b-1  |     await self.app(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
llama-8b-1  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
llama-8b-1  |     raise exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
llama-8b-1  |     await app(scope, receive, sender)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
llama-8b-1  |     response = await f(request)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
llama-8b-1  |     raw_response = await run_endpoint_function(
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
llama-8b-1  |     return await dependant.call(**values)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 271, in create_chat_completion
llama-8b-1  |     generator = await openai_serving_chat.create_chat_completion(
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 188, in create_chat_completion
llama-8b-1  |     return await self.chat_completion_full_generator(
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 438, in chat_completion_full_generator
llama-8b-1  |     async for res in result_generator:
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 430, in iterate_with_cancellation
llama-8b-1  |     item = await awaits[0]
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 416, in generate
llama-8b-1  |     raise request_output
llama-8b-1  | AssertionError
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: , Traceback (most recent call last):
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     output = self.execute_model(execute_model_req=None)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 322, in execute_model
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 271, in execute_model
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     model_input = self._advance_step(
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 361, in _advance_step
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     assert isinstance(attn_metadata, FlashAttentionMetadata)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226] AssertionError
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]
llama-8b-1  | INFO 08-25 00:36:27 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm

[Bug]: Llama 3.1 8b-instruct on V-100 (Volta): model loads, but completion request fails #7844

🐛 Describe the bug

Before submitting a new issue...