vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.23k stars 4.75k forks source link

[Bug]: Mismatch in TTFT count and number of successful requests completed #8115

Open jerin-scalers-ai opened 3 months ago

jerin-scalers-ai commented 3 months ago

Your current environment

šŸ› Describe the bug

The vLLM metrics endpoint is showing discrepancies between 'time_to_first_token_seconds_count' and the total number of successful requests completed. According to my understanding, the time to first token seconds count should align with the total number of requests processed.

Example: vllm:time_to_first_token_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0

vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3810.0 vllm:request_success_total{finished_reason="length",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 108.0

The TTFT count (8060) is higher that the total number of request (3810+108 = 3918).

Added the entire metrics endpoint output for the reference.

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 170852.0
python_gc_objects_collected_total{generation="1"} 1.0068423e+07
python_gc_objects_collected_total{generation="2"} 7.253821e+06
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 57712.0
python_gc_collections_total{generation="1"} 5245.0
python_gc_collections_total{generation="2"} 216.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.39083964416e+011
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 8.135692288e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72536837655e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1555.14
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 2208.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP vllm:cache_config_info information of cache_config
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",num_cpu_blocks="1638",num_gpu_blocks="1558",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 169.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 1876.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.9897304236200257
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
# HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
# TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3973.0
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 483954.0
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 814577.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.25",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="0.75",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="2.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="7.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8060.0
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.5871782302856445
# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
vllm:time_per_output_token_seconds_bucket{le="0.01",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.025",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.05",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.075",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.1",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.15",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.2",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.3",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.4",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="0.75",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="2.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 806517.0
vllm:time_per_output_token_seconds_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 438.11926794052124
# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
vllm:e2e_request_latency_seconds_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:e2e_request_latency_seconds_bucket{le="2.5",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:e2e_request_latency_seconds_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:e2e_request_latency_seconds_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="15.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="30.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 69.0
vllm:e2e_request_latency_seconds_bucket{le="40.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 142.0
vllm:e2e_request_latency_seconds_bucket{le="50.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 186.0
vllm:e2e_request_latency_seconds_bucket{le="60.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 244.0
vllm:e2e_request_latency_seconds_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:e2e_request_latency_seconds_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:e2e_request_latency_seconds_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 1.1619259887051582e+06
# HELP vllm:request_prompt_tokens Number of prefill tokens processed.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="2.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_prompt_tokens_bucket{le="50.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3905.0
vllm:request_prompt_tokens_bucket{le="100.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="200.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="500.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="1000.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="2000.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_prompt_tokens_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 155419.0
# HELP vllm:request_generation_tokens Number of generation tokens processed.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="2.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 0.0
vllm:request_generation_tokens_bucket{le="50.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 8.0
vllm:request_generation_tokens_bucket{le="100.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 10.0
vllm:request_generation_tokens_bucket{le="200.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 1785.0
vllm:request_generation_tokens_bucket{le="500.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_bucket{le="1000.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_bucket{le="2000.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_generation_tokens_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 797555.0
# HELP vllm:request_params_best_of Histogram of the best_of request parameter.
# TYPE vllm:request_params_best_of histogram
vllm:request_params_best_of_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="2.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_best_of_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
# HELP vllm:request_params_n Histogram of the n request parameter.
# TYPE vllm:request_params_n histogram
vllm:request_params_n_bucket{le="1.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="2.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="5.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="10.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="20.0",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_bucket{le="+Inf",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_count{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
vllm:request_params_n_sum{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3918.0
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 3810.0
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 108.0
# HELP vllm:spec_decode_draft_acceptance_rate Speulative token acceptance rate.
# TYPE vllm:spec_decode_draft_acceptance_rate gauge
# HELP vllm:spec_decode_efficiency Speculative decoding system efficiency.
# TYPE vllm:spec_decode_efficiency gauge
# HELP vllm:spec_decode_num_accepted_tokens_total Number of accepted tokens.
# TYPE vllm:spec_decode_num_accepted_tokens_total counter
# HELP vllm:spec_decode_num_draft_tokens_total Number of draft tokens.
# TYPE vllm:spec_decode_num_draft_tokens_total counter
# HELP vllm:spec_decode_num_emitted_tokens_total Number of emitted tokens.
# TYPE vllm:spec_decode_num_emitted_tokens_total counter
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 986.1923544067989
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="meta-llama/Meta-Llama-3-70B-Instruct"} 853.1425125755665

Before submitting a new issue...

robertgshaw2-neuralmagic commented 3 months ago

I need to review the logic, but this may be due to the preemptions, which cause the prefill to be recomputed. We may be counting this twice.

jerin-scalers-ai commented 3 months ago

Another observation , the TTFT count matches the total number of completed request in some cases especially when I ran a smaller models with vLLM in the same hardware.

jerin-scalers-ai commented 1 month ago

@robertgshaw2-neuralmagic can we rely on this numbers (time_to_first_token_seconds) provided by production metrics endpoint for the purpose of performance analysis?

jerin-scalers-ai commented 5 days ago

The vLLM metrics endpoint is reporting anomalous values for the Time To First Token (TTFT) sum, particularly when handling high concurrent requests. This appears to be a metrics collection or reporting bug.

Observed Behavior

When scaling up concurrent requests, the time_to_first_token_seconds_sum_total shows unexpected behavior:

In the table below , for 8192 concurrent request, we can observe that TTFT sum looks incorrect.

model_name model_precision TP concurrent_requests requests_per_second time_taken_for_tests tokensinfo_completion_tokens_total time_to_first_token_seconds_sum_total time_to_first_token_seconds_count_total time_to_first_token_seconds_average Throughput(Tokens/second)
meta-llama/Llama-3.1-70B-Instruct float16 2 256 29.31 43.664 161040 1806.257423 1585 1.139594588 3688.16
meta-llama/Llama-3.1-70B-Instruct float16 2 512 31.78 80.558 324285 19657.84633 3540 5.553063936 4025.48
meta-llama/Llama-3.1-70B-Instruct float16 2 2048 34 301.157 1295917 476298.9308 14446 32.97099064 4303.13
meta-llama/Llama-3.1-70B-Instruct float16 2 4096 34.64 591.169 2592957 2014378.623 28815 69.90729215 4386.15
meta-llama/Llama-3.1-70B-Instruct float16 2 8192 34.64 1182.616 5187284 8.287261784 <- Anomalous value 57698 0.0001436316993 4386.28