vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.79k stars 4.1k forks source link

Missing prometheus metrics in `0.3.0` #2850

Closed SamComber closed 2 months ago

SamComber commented 7 months ago

First of all, thanks for the great open source library!

The docs promise a few more additional metrics that I'm not seeing in vLLM 0.3.0, have these been removed? I.e. if I hit /metrics of the OpenAI API server for a deployed model... you'll see no vllm:time_to_first_token_seconds or vllm:time_per_output_token_seconds or vllm:e2e_request_latency_seconds

image
SamComber commented 7 months ago

Just realised the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.

Would it be possible to push another image version to docker hub with the updates?

https://hub.docker.com/r/vllm/vllm-openai/tags

robertgshaw2-neuralmagic commented 7 months ago

I think a new release will be pushed soon -> https://github.com/vllm-project/vllm/issues/2859

grandiose-pizza commented 6 months ago

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.

Would it be possible to push another image ver

Hi,

@SamComber I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
hmellor commented 6 months ago

@grandiose-pizza did you start your server with --disable-log-stats? That will prevent the Prometheus metrics from being updated.

grandiose-pizza commented 6 months ago

@hmellor , no It is set to false while starting:

INFO worker.py:1752 -- Started a local Ray instance.
ens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=True, disable_log_requests=False, max_log_len=None)

Do I need to add anything to this line? https://github.com/vllm-project/vllm/blob/563c1d7ec56aa0f9fdc28720f3517bf9297f5476/vllm/entrypoints/openai/api_server.py#L57

hmellor commented 6 months ago

Also it's worth noting that what you're seeing is different because the original screenshot was taken before we switched from aioprometheus (third party) to prometheus_client (first party).

grandiose-pizza commented 6 months ago

Could you please share what is expected while using prometheus_client instead?

Is it different than the comment above? https://github.com/vllm-project/vllm/issues/2850#issuecomment-2028747162

hmellor commented 6 months ago

Changing Prometheus client packages only changes the non-vllm:... metrics, which is what you observed.

The vllm:... metrics should be unchanged.

grandiose-pizza commented 6 months ago

It is quite strange. Trying to figure how to obtain the stats like here: https://github.com/vllm-project/vllm/blob/563c1d7ec56aa0f9fdc28720f3517bf9297f5476/vllm/engine/metrics.py#L20

yabea commented 6 months ago

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main. Would it be possible to push another image ver

Hi,

@SamComber I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I have encountered the same issue as well. If you have resolved it, Pleasel let me know. Thank you.

kalpesh22-21 commented 3 months ago

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main. Would it be possible to push another image ver

Hi,

@SamComber I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I am facing same issue

leokster commented 1 month ago

is there any update or workaround for this issue?

pseudotensor commented 1 month ago

Seeing same thing, only basic stats in metrics, no usage, and promethus is not being populated.

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 156170.0
python_gc_objects_collected_total{generation="1"} 180292.0
python_gc_objects_collected_total{generation="2"} 114521.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 2102.0
python_gc_collections_total{generation="1"} 191.0
python_gc_collections_total{generation="2"} 10.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.4693138432e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.168400384e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72430453209e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 59.7
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 23.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I think maybe broken in 0.5.4.

On SAME host system also running 0.5.4, just different model, I get more stuff:

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 3.74064e+07
python_gc_objects_collected_total{generation="1"} 3.649437e+06
python_gc_objects_collected_total{generation="2"} 157913.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 63451.0
python_gc_collections_total{generation="1"} 5766.0
python_gc_collections_total{generation="2"} 105.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.34454657024e+011
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.4426368e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72178452291e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 19123.94
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 79.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP vllm:cache_config_info information of cache_config
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.95",num_cpu_blocks="1638",num_gpu_blocks="16334",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
# TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 7.5344734e+07
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 954848.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.25",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.75",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="7.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1.7057619094848633
# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
vllm:time_per_output_token_seconds_bucket{le="0.01",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.025",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.05",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.075",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.1",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.15",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.2",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.3",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.4",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.75",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14.214749813079834
# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
vllm:e2e_request_latency_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14153.0
vllm:e2e_request_latency_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16216.0
vllm:e2e_request_latency_seconds_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17117.0
vllm:e2e_request_latency_seconds_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17400.0
vllm:e2e_request_latency_seconds_bucket{le="15.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17473.0
vllm:e2e_request_latency_seconds_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17484.0
vllm:e2e_request_latency_seconds_bucket{le="30.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="40.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="60.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 15472.243278980255
# HELP vllm:request_prompt_tokens Number of prefill tokens processed.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
vllm:request_prompt_tokens_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
vllm:request_prompt_tokens_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 2.0
vllm:request_prompt_tokens_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 766.0
vllm:request_prompt_tokens_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 976.0
vllm:request_prompt_tokens_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1829.0
vllm:request_prompt_tokens_bucket{le="100.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1954.0
vllm:request_prompt_tokens_bucket{le="200.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 3540.0
vllm:request_prompt_tokens_bucket{le="500.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 4995.0
vllm:request_prompt_tokens_bucket{le="1000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 5979.0
vllm:request_prompt_tokens_bucket{le="2000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 8386.0
vllm:request_prompt_tokens_bucket{le="5000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 12431.0
vllm:request_prompt_tokens_bucket{le="10000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14606.0
vllm:request_prompt_tokens_bucket{le="20000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17136.0
vllm:request_prompt_tokens_bucket{le="50000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17466.0
vllm:request_prompt_tokens_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_prompt_tokens_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_prompt_tokens_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 7.5331484e+07
# HELP vllm:request_generation_tokens Number of generation tokens processed.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 18.0
vllm:request_generation_tokens_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 45.0
vllm:request_generation_tokens_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 66.0
vllm:request_generation_tokens_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 2394.0
vllm:request_generation_tokens_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 5039.0
vllm:request_generation_tokens_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 15113.0
vllm:request_generation_tokens_bucket{le="100.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16031.0
vllm:request_generation_tokens_bucket{le="200.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16491.0
vllm:request_generation_tokens_bucket{le="500.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17257.0
vllm:request_generation_tokens_bucket{le="1000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17436.0
vllm:request_generation_tokens_bucket{le="2000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17488.0
vllm:request_generation_tokens_bucket{le="5000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="10000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="20000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="50000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 954675.0
# HELP vllm:request_params_best_of Histogram of the best_of request parameter.
# TYPE vllm:request_params_best_of histogram
vllm:request_params_best_of_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
# HELP vllm:request_params_n Histogram of the n request parameter.
# TYPE vllm:request_params_n histogram
vllm:request_params_n_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 965.0
vllm:request_success_total{finished_reason="stop",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16530.0
# HELP vllm:spec_decode_draft_acceptance_rate Speulative token acceptance rate.
# TYPE vllm:spec_decode_draft_acceptance_rate gauge
# HELP vllm:spec_decode_efficiency Speculative decoding system efficiency.
# TYPE vllm:spec_decode_efficiency gauge
# HELP vllm:spec_decode_num_accepted_tokens_total Number of accepted tokens.
# TYPE vllm:spec_decode_num_accepted_tokens_total counter
# HELP vllm:spec_decode_num_draft_tokens_total Number of draft tokens.
# TYPE vllm:spec_decode_num_draft_tokens_total counter
# HELP vllm:spec_decode_num_emitted_tokens_total Number of emitted tokens.
# TYPE vllm:spec_decode_num_emitted_tokens_total counter
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0

is it possible that some models do not support those other metrics?

pseudotensor commented 1 month ago

@hmellor Why was this issue closed as not planned? It seems like clearly a bug for a useful thing.

robertgshaw2-neuralmagic commented 1 month ago
hmellor commented 1 month ago

@pseudotensor Annoyingly, "not planned" can mean many things (why we can't specify which thing, I don't know), but this was closed as stale originally.

image
pseudotensor commented 1 month ago

No problem, it's all working in main. Thanks!