Open jerin-scalers-ai opened 3 months ago
I need to review the logic, but this may be due to the preemptions, which cause the prefill to be recomputed. We may be counting this twice.
Another observation , the TTFT count matches the total number of completed request in some cases especially when I ran a smaller models with vLLM in the same hardware.
@robertgshaw2-neuralmagic can we rely on this numbers (time_to_first_token_seconds) provided by production metrics endpoint for the purpose of performance analysis?
The vLLM metrics endpoint is reporting anomalous values for the Time To First Token (TTFT) sum, particularly when handling high concurrent requests. This appears to be a metrics collection or reporting bug.
When scaling up concurrent requests, the time_to_first_token_seconds_sum_total shows unexpected behavior:
In the table below , for 8192 concurrent request, we can observe that TTFT sum looks incorrect.
model_name | model_precision | TP | concurrent_requests | requests_per_second | time_taken_for_tests | tokensinfo_completion_tokens_total | time_to_first_token_seconds_sum_total | time_to_first_token_seconds_count_total | time_to_first_token_seconds_average | Throughput(Tokens/second) |
---|---|---|---|---|---|---|---|---|---|---|
meta-llama/Llama-3.1-70B-Instruct | float16 | 2 | 256 | 29.31 | 43.664 | 161040 | 1806.257423 | 1585 | 1.139594588 | 3688.16 |
meta-llama/Llama-3.1-70B-Instruct | float16 | 2 | 512 | 31.78 | 80.558 | 324285 | 19657.84633 | 3540 | 5.553063936 | 4025.48 |
meta-llama/Llama-3.1-70B-Instruct | float16 | 2 | 2048 | 34 | 301.157 | 1295917 | 476298.9308 | 14446 | 32.97099064 | 4303.13 |
meta-llama/Llama-3.1-70B-Instruct | float16 | 2 | 4096 | 34.64 | 591.169 | 2592957 | 2014378.623 | 28815 | 69.90729215 | 4386.15 |
meta-llama/Llama-3.1-70B-Instruct | float16 | 2 | 8192 | 34.64 | 1182.616 | 5187284 | 8.287261784 <- Anomalous value | 57698 | 0.0001436316993 | 4386.28 |
Your current environment
š Describe the bug
The vLLM metrics endpoint is showing discrepancies between 'time_to_first_token_seconds_count' and the total number of successful requests completed. According to my understanding, the time to first token seconds count should align with the total number of requests processed.
Example: vllm:time_to_first_token_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3810.0 vllm:request_success_total{finished_reason="length",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 108.0
The TTFT count (8060) is higher that the total number of request (3810+108 = 3918).
Added the entire metrics endpoint output for the reference.
Before submitting a new issue...