[Bug]: Mismatch in TTFT count and number of successful requests completed

jerin-scalers-ai commented 3 months ago

Your current environment

vLLM version: v0.5.3.post1 (Public Docker Image )
Model: Llama 3 70 B
Dtype: FP16
GPU: Nvidia H100

🐛 Describe the bug

The vLLM metrics endpoint is showing discrepancies between 'time_to_first_token_seconds_count' and the total number of successful requests completed. According to my understanding, the time to first token seconds count should align with the total number of requests processed.

Example: vllm:time_to_first_token_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0

vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3810.0 vllm:request_success_total{finished_reason="length",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 108.0

The TTFT count (8060) is higher that the total number of request (3810+108 = 3918).

Added the entire metrics endpoint output for the reference.

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 170852.0
python_gc_objects_collected_total{generation="1"} 1.0068423e+07
python_gc_objects_collected_total{generation="2"} 7.253821e+06
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 57712.0
python_gc_collections_total{generation="1"} 5245.0
python_gc_collections_total{generation="2"} 216.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.39083964416e+011
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 8.135692288e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72536837655e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1555.14
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 2208.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP vllm:cache_config_info information of cache_config
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",num_cpu_blocks="1638",num_gpu_blocks="1558",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 169.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 1876.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.9897304236200257
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
# HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
# TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3973.0
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 483954.0
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 814577.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.25",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.75",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="2.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="7.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.5871782302856445
# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
vllm:time_per_output_token_seconds_bucket{le="0.01",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.025",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.05",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.075",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.1",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.15",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.2",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.3",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.4",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.75",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="2.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 438.11926794052124
# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
vllm:e2e_request_latency_seconds_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:e2e_request_latency_seconds_bucket{le="2.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:e2e_request_latency_seconds_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:e2e_request_latency_seconds_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="15.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="30.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 69.0
vllm:e2e_request_latency_seconds_bucket{le="40.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 142.0
vllm:e2e_request_latency_seconds_bucket{le="50.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 186.0
vllm:e2e_request_latency_seconds_bucket{le="60.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 244.0
vllm:e2e_request_latency_seconds_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:e2e_request_latency_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:e2e_request_latency_seconds_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 1.1619259887051582e+06
# HELP vllm:request_prompt_tokens Number of prefill tokens processed.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="2.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="50.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3905.0
vllm:request_prompt_tokens_bucket{le="100.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="200.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="500.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="1000.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="2000.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 155419.0
# HELP vllm:request_generation_tokens Number of generation tokens processed.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="2.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="50.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8.0
vllm:request_generation_tokens_bucket{le="100.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 10.0
vllm:request_generation_tokens_bucket{le="200.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 1785.0
vllm:request_generation_tokens_bucket{le="500.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_bucket{le="1000.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_bucket{le="2000.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 797555.0
# HELP vllm:request_params_best_of Histogram of the best_of request parameter.
# TYPE vllm:request_params_best_of histogram
vllm:request_params_best_of_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="2.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
# HELP vllm:request_params_n Histogram of the n request parameter.
# TYPE vllm:request_params_n histogram
vllm:request_params_n_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="2.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3810.0
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 108.0
# HELP vllm:spec_decode_draft_acceptance_rate Speulative token acceptance rate.
# TYPE vllm:spec_decode_draft_acceptance_rate gauge
# HELP vllm:spec_decode_efficiency Speculative decoding system efficiency.
# TYPE vllm:spec_decode_efficiency gauge
# HELP vllm:spec_decode_num_accepted_tokens_total Number of accepted tokens.
# TYPE vllm:spec_decode_num_accepted_tokens_total counter
# HELP vllm:spec_decode_num_draft_tokens_total Number of draft tokens.
# TYPE vllm:spec_decode_num_draft_tokens_total counter
# HELP vllm:spec_decode_num_emitted_tokens_total Number of emitted tokens.
# TYPE vllm:spec_decode_num_emitted_tokens_total counter
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 986.1923544067989
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 853.1425125755665

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

robertgshaw2-neuralmagic commented 3 months ago

I need to review the logic, but this may be due to the preemptions, which cause the prefill to be recomputed. We may be counting this twice.

jerin-scalers-ai commented 3 months ago

Another observation , the TTFT count matches the total number of completed request in some cases especially when I ran a smaller models with vLLM in the same hardware.

jerin-scalers-ai commented 1 month ago

@robertgshaw2-neuralmagic can we rely on this numbers (time_to_first_token_seconds) provided by production metrics endpoint for the purpose of performance analysis?

jerin-scalers-ai commented 5 days ago

The vLLM metrics endpoint is reporting anomalous values for the Time To First Token (TTFT) sum, particularly when handling high concurrent requests. This appears to be a metrics collection or reporting bug.

Observed Behavior

When scaling up concurrent requests, the time_to_first_token_seconds_sum_total shows unexpected behavior:

For lower concurrent requests (256-4096), the TTFT sum scales up as expected
At 8192 concurrent requests, the TTFT sum drops dramatically to an unrealistic value (8.29 seconds)

In the table below , for 8192 concurrent request, we can observe that TTFT sum looks incorrect.

model_name	model_precision	TP	concurrent_requests	requests_per_second	time_taken_for_tests	tokensinfo_completion_tokens_total	time_to_first_token_seconds_sum_total	time_to_first_token_seconds_count_total	time_to_first_token_seconds_average	Throughput(Tokens/second)
meta-llama/Llama-3.1-70B-Instruct	float16	2	256	29.31	43.664	161040	1806.257423	1585	1.139594588	3688.16
meta-llama/Llama-3.1-70B-Instruct	float16	2	512	31.78	80.558	324285	19657.84633	3540	5.553063936	4025.48
meta-llama/Llama-3.1-70B-Instruct	float16	2	2048	34	301.157	1295917	476298.9308	14446	32.97099064	4303.13
meta-llama/Llama-3.1-70B-Instruct	float16	2	4096	34.64	591.169	2592957	2014378.623	28815	69.90729215	4386.15
meta-llama/Llama-3.1-70B-Instruct	float16	2	8192	34.64	1182.616	5187284	8.287261784 <- Anomalous value	57698	0.0001436316993	4386.28

vllm-project / vllm