simon-mo commented 3 months ago

🚀 The feature, motivation and pitch

Great feedback from one of our user:

For our production monitoring, it'd be great to have more operational metrics for us to see the health, utilization, and pressure of the system, e.g. input/output token counts, success/error request counts, batch size, batched token counts, etc. IBM's TGI fork has a pretty nice list of metrics which may be a good reference.

Alternatives

No response

Additional context

No response

simon-mo commented 3 months ago

Oh some useful things to track in observability metrics are usage/performance of lora adapters, automatic prefix caching, chunked prefill, and spec decode acceptance rate, etc.

robertgshaw2-neuralmagic commented 3 months ago

Happy to shepard this @simon-mo

simon-mo commented 3 months ago

@nunjunj will work on this

AgrawalAmey commented 3 months ago

@simon-mo the sarathi fork also has extensive metric logging framework if that is of interest - https://microsoft-research.wandb.io/msri-ai-infrastructure/llm-simulator-v2/reports/Sarathi-Benchmark-Suite-Demo--VmlldzoyNDMx?accessToken=d81jj8r843ntfhjle51uac1y57jvm80urmizil5rxt9jcafqnd1eib5swevpfejx

simon-mo commented 3 months ago

Great list!

AgrawalAmey commented 3 months ago

Let me know if you want us to create a PR for this

rkooo567 commented 3 months ago

This looks awesome! Is it only working with wandb for visualization?

simon-mo commented 3 months ago

+1 if it works with our stats logger this is so good!

AgrawalAmey commented 3 months ago

We might not be able to reuse the existing stats logger. The Sarathi logging framework has the following properties:

It is designed for tracking several different kinds of metrics (cdfs, histograms, time series, bar charts, running averages, etc) for large experiments. Based on the type of the plot, we use different data structures -- datasketch for cdf, sub sampled arrays for time series etc. Without these optimizations, we metric logging was becoming a massive overhead for us.
It operates at different granularity levels - request, batch and kernel. So metrics can be obtained on the scheduler side, some require worker side data collection. So our metric store is designed to collate metrics from different workers to provide a unified view. We also support collation across replicas for a cluster wide view.
The data output backends are designed to support multiple backends. Right now we support csv, plotly and wandb. In future the plan is to also extend this to work with some streaming metric service live Prometheus/influxdb.

robertgshaw2-neuralmagic commented 3 months ago

vLLM Production Monitoring Roadmap

PROPOSED PLAN

For production monitoring, we effectively need to track two things:

A) Server-level Metrics --> these are global metrics that track the state and performance of the LLMEngine class. These are typically exposed as gauges or counters in Prometheus
B) Request-level Metrics: these are metrics that track the timing and flow of an individual SequenceGroup. These are typically exposed as histrograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking

The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.

Current State

Current StatLogger implementation is here

We currently track the following Server-level metrics:

vllm:num_requests_running
vllm:num_requests_swapped
vllm:num_requests_waiting
vllm:gpu_cache_usage_perc
vllm:cpu_cache_usage_perc
vllm:generation_tokens_total

We currently track the following Request-level metrics:

vllm:time_to_first_token_seconds
vllm:time_per_output_token_seconds
vllm:e2e_request_latency_seconds

As you can see, we are missing some basic tracking information

Production Monitoring Expansion Phase 1

We need to start by "catching up" on the lacking metrics.

We can start by matching IBM's TGI fork, across high level categories

Number of requests: request_count, request_success, request_failure
Request metadata: max_new_tokens, input_length, num_generated_tokens
Request timing: queue_duration / oh from openai server?
Server overhead timings: pre-processing / post-processing (e.g. @dsikka Timer object)
Engine invocations: count, success, failure
Prefill metrics: number of tokens in the batch

I will fill in the details over the course of the week

(cc @dsikka @horheynm for visibility, we have some ongoing work that can be combined with this initiative)

Production Monitoring Expansion Phase 2-N

Once this is done, we can expand the metrics to support "advanced" vLLM features.

Sample features include:

LoRA
Speculative decode
Chunked prefill
Automatic prefix caching
Multimodal models

robertgshaw2-neuralmagic commented 3 months ago

We might not be able to reuse the existing stats logger. The Sarathi logging framework has the following properties:

It is designed for tracking several different kinds of metrics (cdfs, histograms, time series, bar charts, running averages, etc) for large experiments. Based on the type of the plot, we use different data structures -- datasketch for cdf, sub sampled arrays for time series etc. Without these optimizations, we metric logging was becoming a massive overhead for us.

It operates at different granularity levels - request, batch and kernel. So metrics can be obtained on the scheduler side, some require worker side data collection. So our metric store is designed to collate metrics from different workers to provide a unified view. We also support collation across replicas for a cluster wide view.

The data output backends are designed to support multiple backends. Right now we support csv, plotly and wandb. In future the plan is to also extend this to work with some streaming metric service live Prometheus/influxdb.

Do you primarily use this for production monitoring or offline analysis?

@AgrawalAmey

AgrawalAmey commented 2 months ago

@robertgshaw2-neuralmagic sorry for the delay. We have been mostly using it for offline analysis, but the data structures are designed such that they should be relatively easy to extend them for production serving usecases as well.

vllm-project / vllm

[Feature]: Even Better Observability #3616

🚀 The feature, motivation and pitch

Alternatives

Additional context

vLLM Production Monitoring Roadmap

Current State

Production Monitoring Expansion Phase 1

Production Monitoring Expansion Phase 2-N