vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.85k stars 4.51k forks source link

[Bug]: Llama 3.1 8b-instruct on V-100 (Volta): model loads, but completion request fails #7844

Closed ergleb78 closed 2 months ago

ergleb78 commented 2 months ago

🐛 Describe the bug

I'm trying to tun Llama 3.1-8b-instruct on V-100 (Volta). Model loads fine with the following settings:

    command: >
      --host 0.0.0.0
      --dtype=half
      --tensor-parallel-size 8
      --enforce-eager
      --num-scheduler-steps 8
      --gpu_memory_utilization 0.95
      --enable-chunked-prefill=false
      --max-model-len=4096
      --trust-remote-code
      --served-model-name Llama-3.1-8b
      --model /models/meta-llama/Meta-Llama-3.1-8B-Instruct

Any chat completion request crashes:

ama-8b-1  |
llama-8b-1  | During handling of the above exception, another exception occurred:
llama-8b-1  |
llama-8b-1  | Traceback (most recent call last):
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
llama-8b-1  |     result = await app(  # type: ignore[func-returns-value]
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
llama-8b-1  |     return await self.app(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
llama-8b-1  |     await super().__call__(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
llama-8b-1  |     await self.middleware_stack(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
llama-8b-1  |     raise exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
llama-8b-1  |     await self.app(scope, receive, _send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 189, in __call__
llama-8b-1  |     with collapse_excgroups():
llama-8b-1  |   File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
llama-8b-1  |     self.gen.throw(typ, value, traceback)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 89, in collapse_excgroups
llama-8b-1  |     raise exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 191, in __call__
llama-8b-1  |     response = await self.dispatch_func(request, call_next)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 361, in authentication
llama-8b-1  |     return await call_next(request)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 165, in call_next
llama-8b-1  |     raise app_exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 151, in coro
llama-8b-1  |     await self.app(scope, receive_or_disconnect, send_no_error)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
llama-8b-1  |     await self.app(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
llama-8b-1  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
llama-8b-1  |     raise exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
llama-8b-1  |     await app(scope, receive, sender)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
llama-8b-1  |     await self.middleware_stack(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
llama-8b-1  |     await route.handle(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
llama-8b-1  |     await self.app(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
llama-8b-1  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
llama-8b-1  |     raise exc
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
llama-8b-1  |     await app(scope, receive, sender)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
llama-8b-1  |     response = await f(request)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
llama-8b-1  |     raw_response = await run_endpoint_function(
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
llama-8b-1  |     return await dependant.call(**values)
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 271, in create_chat_completion
llama-8b-1  |     generator = await openai_serving_chat.create_chat_completion(
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 188, in create_chat_completion
llama-8b-1  |     return await self.chat_completion_full_generator(
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 438, in chat_completion_full_generator
llama-8b-1  |     async for res in result_generator:
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 430, in iterate_with_cancellation
llama-8b-1  |     item = await awaits[0]
llama-8b-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 416, in generate
llama-8b-1  |     raise request_output
llama-8b-1  | AssertionError
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: , Traceback (most recent call last):
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     output = self.execute_model(execute_model_req=None)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 322, in execute_model
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     output = self.model_runner.execute_model(
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 271, in execute_model
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     model_input = self._advance_step(
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 361, in _advance_step
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]     assert isinstance(attn_metadata, FlashAttentionMetadata)
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226] AssertionError
llama-8b-1  | (VllmWorkerProcess pid=213) ERROR 08-25 00:36:13 multiproc_worker_utils.py:226]
llama-8b-1  | INFO 08-25 00:36:27 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Before submitting a new issue...

Huarong commented 2 months ago

v100 does not support flash attention.
It seems like num-scheduler-steps need flash attention from the error message. Try remove --num-scheduler-steps 8

ergleb78 commented 2 months ago

v100 does not support flash attention. It seems like num-scheduler-steps need flash attention from the error message. Try remove --num-scheduler-steps 8

@Huarong thanks a lot! It worked perfectly fine. I could not figure out where flash attention was coming from. Appreciate your help!