Open simon-mo opened 3 months ago
Oh some useful things to track in observability metrics are usage/performance of lora adapters, automatic prefix caching, chunked prefill, and spec decode acceptance rate, etc.
Happy to shepard this @simon-mo
@nunjunj will work on this
@simon-mo the sarathi fork also has extensive metric logging framework if that is of interest - https://microsoft-research.wandb.io/msri-ai-infrastructure/llm-simulator-v2/reports/Sarathi-Benchmark-Suite-Demo--VmlldzoyNDMx?accessToken=d81jj8r843ntfhjle51uac1y57jvm80urmizil5rxt9jcafqnd1eib5swevpfejx
Great list!
Let me know if you want us to create a PR for this
This looks awesome! Is it only working with wandb for visualization?
+1 if it works with our stats logger this is so good!
We might not be able to reuse the existing stats logger. The Sarathi logging framework has the following properties:
It is designed for tracking several different kinds of metrics (cdfs, histograms, time series, bar charts, running averages, etc) for large experiments. Based on the type of the plot, we use different data structures -- datasketch for cdf, sub sampled arrays for time series etc. Without these optimizations, we metric logging was becoming a massive overhead for us.
It operates at different granularity levels - request, batch and kernel. So metrics can be obtained on the scheduler side, some require worker side data collection. So our metric store is designed to collate metrics from different workers to provide a unified view. We also support collation across replicas for a cluster wide view.
The data output backends are designed to support multiple backends. Right now we support csv, plotly and wandb. In future the plan is to also extend this to work with some streaming metric service live Prometheus/influxdb.
PROPOSED PLAN
For production monitoring, we effectively need to track two things:
A) Server-level Metrics --> these are global metrics that track the state and performance of the LLMEngine
class. These are typically exposed as gauges
or counters
in Prometheus
B) Request-level Metrics: these are metrics that track the timing and flow of an individual SequenceGroup
. These are typically exposed as histrograms
in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking
The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.
We currently track the following Server-level metrics:
vllm:num_requests_running
vllm:num_requests_swapped
vllm:num_requests_waiting
vllm:gpu_cache_usage_perc
vllm:cpu_cache_usage_perc
vllm:generation_tokens_total
We currently track the following Request-level metrics:
vllm:time_to_first_token_seconds
vllm:time_per_output_token_seconds
vllm:e2e_request_latency_seconds
As you can see, we are missing some basic tracking information
We need to start by "catching up" on the lacking metrics.
We can start by matching IBM's TGI fork, across high level categories
Timer
object)I will fill in the details over the course of the week
(cc @dsikka @horheynm for visibility, we have some ongoing work that can be combined with this initiative)
Once this is done, we can expand the metrics to support "advanced" vLLM features.
Sample features include:
We might not be able to reuse the existing stats logger. The Sarathi logging framework has the following properties:
- It is designed for tracking several different kinds of metrics (cdfs, histograms, time series, bar charts, running averages, etc) for large experiments. Based on the type of the plot, we use different data structures -- datasketch for cdf, sub sampled arrays for time series etc. Without these optimizations, we metric logging was becoming a massive overhead for us.
- It operates at different granularity levels - request, batch and kernel. So metrics can be obtained on the scheduler side, some require worker side data collection. So our metric store is designed to collate metrics from different workers to provide a unified view. We also support collation across replicas for a cluster wide view.
- The data output backends are designed to support multiple backends. Right now we support csv, plotly and wandb. In future the plan is to also extend this to work with some streaming metric service live Prometheus/influxdb.
Do you primarily use this for production monitoring or offline analysis?
@AgrawalAmey
@robertgshaw2-neuralmagic sorry for the delay. We have been mostly using it for offline analysis, but the data structures are designed such that they should be relatively easy to extend them for production serving usecases as well.
🚀 The feature, motivation and pitch
Great feedback from one of our user:
Alternatives
No response
Additional context
No response