vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.65k stars 4.47k forks source link

[Performance]: Slow TTFT(?) for Qwen2-72B-GPTQ-Int4 on H100 *2 #6781

Open cyc00518 opened 3 months ago

cyc00518 commented 3 months ago

I did some tests in order to find better parameter to speed up, and it appears that there hasn't been a significant change in TTFT (Time To First Token). Is my TTFT correct? I feel it might be a bit too slow...

Here's the test environment: H100 *2, vllm = 0.5.3post1

Test script:

python benchmark_serving.py --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --backend vllm --model Qwen2-72B-Int4 --request-rate 12 --num_prompts=1000

1st test set:

python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --disable-log-requests --dtype auto --quantization marlin --max-model-len 8000

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  172.44    
Total input tokens:                      217393    
Total generated tokens:                  193576    
Request throughput (req/s):              5.80      
Input token throughput (tok/s):          1260.72   
Output token throughput (tok/s):         1122.60   
---------------Time to First Token----------------
Mean TTFT (ms):                          18799.19  
Median TTFT (ms):                        12578.35  
P99 TTFT (ms):                           54378.33  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          202.99    
Median TPOT (ms):                        206.15    
P99 TPOT (ms):                           351.08    
---------------Inter-token Latency----------------
Mean ITL (ms):                           280.82    
Median ITL (ms):                         145.93    
P99 ITL (ms):                            787.73    
==================================================

2nd test set (added enable_prefix_caching):

python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --disable-log-requests --dtype auto --quantization marlin --max-model-len 8000 --enable_prefix_caching

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  176.09    
Total input tokens:                      217393    
Total generated tokens:                  192400    
Request throughput (req/s):              5.68      
Input token throughput (tok/s):          1234.57   
Output token throughput (tok/s):         1092.64   
---------------Time to First Token----------------
Mean TTFT (ms):                          19519.29  
Median TTFT (ms):                        13625.79  
P99 TTFT (ms):                           57180.20  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          206.76    
Median TPOT (ms):                        210.99    
P99 TPOT (ms):                           374.38    
---------------Inter-token Latency----------------
Mean ITL (ms):                           290.25    
Median ITL (ms):                         150.11    
P99 ITL (ms):                            775.13    
==================================================

3rd test set (added enable_prefix_caching and max-num-batched-tokens):

python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --disable-log-requests --dtype auto --quantization marlin --max-model-len 8000 --enable_prefix_caching --max-num-batched-tokens 8000

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  176.37    
Total input tokens:                      217393    
Total generated tokens:                  193801    
Request throughput (req/s):              5.67      
Input token throughput (tok/s):          1232.56   
Output token throughput (tok/s):         1098.80   
---------------Time to First Token----------------
Mean TTFT (ms):                          19816.48  
Median TTFT (ms):                        13446.06  
P99 TTFT (ms):                           57615.44  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          206.11    
Median TPOT (ms):                        209.24    
P99 TPOT (ms):                           384.18    
---------------Inter-token Latency----------------
Mean ITL (ms):                           290.55    
Median ITL (ms):                         149.69    
P99 ITL (ms):                            805.39    
==================================================
KuntaiDu commented 3 months ago

Such high TTFT is likely due to request rate being too high. Basically if the request rate is too high, most of the request will wait in the request buffer for a very long time until it is being processed by vLLM, causing very high TTFT.

cyc00518 commented 3 months ago

@KuntaiDu Thank you for your response. I hope you don't mind explaining how I can adjust the enable_prefix_caching and max_num_batched_tokens parameters to optimize performance.

In my case, I need to service a company of approximately 40,000 to 50,000 people, so I set the request_rate to 12 to simulate peak traffic scenarios. However, from my tests, enable_prefix_caching and max_num_batched_tokens did not reduce TTFT; in fact, it became slower.

Then I noticed in the documentation at https://docs.vllm.ai/en/latest/models/performance.html that max_num_batched_tokens needs to be used with enable_chunked_prefill. So I set it up but resulted in errors:

Server script (max-model-len、enable-chunked-prefill、enable_prefix_caching)

python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-GPTQ-Int4 --disable-log-requests --dtype auto --quantization marlin --max-model-len 8000 --enable-chunked-prefill --enable_prefix_caching

Test Script1 (also request-rate=12)

python benchmark_serving.py --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --backend vllm --model Qwen2-72B-Int4 --request-rate 12 --tokenizer /usr/src/app/model/Qwen2-72B-Instruct-GPTQ-Int4 --base-url http://0.0.0.0:40107 --num_prompts=1000

Test Script2 (reduce request-rate=5)

python benchmark_serving.py --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name sharegpt --backend vllm --model Qwen2-72B-Int4 --request-rate 5 --tokenizer /usr/src/app/model/Qwen2-72B-Instruct-GPTQ-Int4 --base-url http://0.0.0.0:40107 --num_prompts=1000

Error logs

Initially, everything was functioning normally, but when executing around 400-500 requests, the following error occurred:

INFO:     127.0.0.1:37558 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fa4240f6d40

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
double-vin commented 3 months ago

same problem

cipolee commented 2 months ago

same problem, have you solved it. is qwen2 not support prefix-caching?