Open cyc00518 opened 3 months ago
Such high TTFT is likely due to request rate being too high. Basically if the request rate is too high, most of the request will wait in the request buffer for a very long time until it is being processed by vLLM, causing very high TTFT.
@KuntaiDu
Thank you for your response. I hope you don't mind explaining how I can adjust the enable_prefix_caching
and max_num_batched_tokens
parameters to optimize performance.
In my case, I need to service a company of approximately 40,000 to 50,000 people
, so I set the request_rate to 12 to simulate peak traffic scenarios. However, from my tests, enable_prefix_caching
and max_num_batched_tokens
did not reduce TTFT; in fact, it became slower.
Then I noticed in the documentation at https://docs.vllm.ai/en/latest/models/performance.html that max_num_batched_tokens
needs to be used with enable_chunked_prefill
. So I set it up but resulted in errors:
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --disable-log-requests --dtype auto --quantization marlin --max-model-len 8000 --enable-chunked-prefill --enable_prefix_caching
python benchmark_serving.py --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --backend vllm --model Qwen2-72B-Int4 --request-rate 12 --tokenizer /usr/src/app/model/Qwen2-72B-Instruct-GPTQ-Int4 --base-url http://0.0.0.0:40107 --num_prompts=1000
python benchmark_serving.py --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --backend vllm --model Qwen2-72B-Int4 --request-rate 5 --tokenizer /usr/src/app/model/Qwen2-72B-Instruct-GPTQ-Int4 --base-url http://0.0.0.0:40107 --num_prompts=1000
Initially, everything was functioning normally, but when executing around 400-500 requests, the following error occurred:
INFO: 127.0.0.1:37558 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
await self.message_event.wait()
File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fa4240f6d40
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
same problem
same problem, have you solved it. is qwen2 not support prefix-caching?
I did some tests in order to find better parameter to speed up, and it appears that there hasn't been a significant change in TTFT (Time To First Token). Is my TTFT correct? I feel it might be a bit too slow...
Here's the test environment:
H100 *2, vllm = 0.5.3post1
Test script:
python benchmark_serving.py --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --backend vllm --model Qwen2-72B-Int4 --request-rate 12 --num_prompts=1000
1st test set:
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --disable-log-requests --dtype auto --quantization marlin --max-model-len 8000
2nd test set (added enable_prefix_caching):
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --disable-log-requests --dtype auto --quantization marlin --max-model-len 8000 --enable_prefix_caching
3rd test set (added enable_prefix_caching and max-num-batched-tokens):
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --disable-log-requests --dtype auto --quantization marlin --max-model-len 8000 --enable_prefix_caching --max-num-batched-tokens 8000