vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
21.93k stars 3.1k forks source link

[Feature]: Even Better Observability #3616

Open simon-mo opened 3 months ago

simon-mo commented 3 months ago

🚀 The feature, motivation and pitch

Great feedback from one of our user:

For our production monitoring, it'd be great to have more operational metrics for us to see the health, utilization, and pressure of the system, e.g. input/output token counts, success/error request counts, batch size, batched token counts, etc. IBM's TGI fork has a pretty nice list of metrics which may be a good reference.

Alternatives

No response

Additional context

No response

simon-mo commented 3 months ago

Oh some useful things to track in observability metrics are usage/performance of lora adapters, automatic prefix caching, chunked prefill, and spec decode acceptance rate, etc.

robertgshaw2-neuralmagic commented 3 months ago

Happy to shepard this @simon-mo

simon-mo commented 3 months ago

@nunjunj will work on this

AgrawalAmey commented 3 months ago

@simon-mo the sarathi fork also has extensive metric logging framework if that is of interest - https://microsoft-research.wandb.io/msri-ai-infrastructure/llm-simulator-v2/reports/Sarathi-Benchmark-Suite-Demo--VmlldzoyNDMx?accessToken=d81jj8r843ntfhjle51uac1y57jvm80urmizil5rxt9jcafqnd1eib5swevpfejx

simon-mo commented 3 months ago

Great list!

AgrawalAmey commented 3 months ago

Let me know if you want us to create a PR for this

rkooo567 commented 3 months ago

This looks awesome! Is it only working with wandb for visualization?

simon-mo commented 3 months ago

+1 if it works with our stats logger this is so good!

AgrawalAmey commented 3 months ago

We might not be able to reuse the existing stats logger. The Sarathi logging framework has the following properties:

  1. It is designed for tracking several different kinds of metrics (cdfs, histograms, time series, bar charts, running averages, etc) for large experiments. Based on the type of the plot, we use different data structures -- datasketch for cdf, sub sampled arrays for time series etc. Without these optimizations, we metric logging was becoming a massive overhead for us.

  2. It operates at different granularity levels - request, batch and kernel. So metrics can be obtained on the scheduler side, some require worker side data collection. So our metric store is designed to collate metrics from different workers to provide a unified view. We also support collation across replicas for a cluster wide view.

  3. The data output backends are designed to support multiple backends. Right now we support csv, plotly and wandb. In future the plan is to also extend this to work with some streaming metric service live Prometheus/influxdb.

robertgshaw2-neuralmagic commented 3 months ago

vLLM Production Monitoring Roadmap

PROPOSED PLAN

For production monitoring, we effectively need to track two things:

The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.

Current State

We currently track the following Server-level metrics:

We currently track the following Request-level metrics:

As you can see, we are missing some basic tracking information

Production Monitoring Expansion Phase 1

We need to start by "catching up" on the lacking metrics.

We can start by matching IBM's TGI fork, across high level categories

I will fill in the details over the course of the week

(cc @dsikka @horheynm for visibility, we have some ongoing work that can be combined with this initiative)

Production Monitoring Expansion Phase 2-N

Once this is done, we can expand the metrics to support "advanced" vLLM features.

Sample features include:

robertgshaw2-neuralmagic commented 3 months ago

We might not be able to reuse the existing stats logger. The Sarathi logging framework has the following properties:

  1. It is designed for tracking several different kinds of metrics (cdfs, histograms, time series, bar charts, running averages, etc) for large experiments. Based on the type of the plot, we use different data structures -- datasketch for cdf, sub sampled arrays for time series etc. Without these optimizations, we metric logging was becoming a massive overhead for us.
  2. It operates at different granularity levels - request, batch and kernel. So metrics can be obtained on the scheduler side, some require worker side data collection. So our metric store is designed to collate metrics from different workers to provide a unified view. We also support collation across replicas for a cluster wide view.
  3. The data output backends are designed to support multiple backends. Right now we support csv, plotly and wandb. In future the plan is to also extend this to work with some streaming metric service live Prometheus/influxdb.

Do you primarily use this for production monitoring or offline analysis?

@AgrawalAmey

AgrawalAmey commented 2 months ago

@robertgshaw2-neuralmagic sorry for the delay. We have been mostly using it for offline analysis, but the data structures are designed such that they should be relatively easy to extend them for production serving usecases as well.