[llm-gateway] add tft and tot metrics and add them to grafana dashboard

katanemo / arch

Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with APIs - all outside business logic. Built by the core contributors of Envoy proxy, on Envoy.

Apache License 2.0

497 stars 26 forks source link

We need the following metrics instrumented

Time to first token (TTFT): This is how quickly users start seeing the model’s output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
Time per output token (TPOT): Time to generate an output token for each user that is querying the system.
Latency = TTFT + (TPOT) * (the number of tokens to be generated)
Output sequence length (OSL) - Total tokens generated
Input sequence length (ISL) - Total input tokens

We need the following metrics histograms in 5 mins, 15 mins, 30 mins, 1 hour, 24 hours, 7 day granularity.

Average (p50)
P90 (90th percentile)
p95 (95th percentile)
p99 (99th percentile)

katanemo / arch

[llm-gateway] add tft and tot metrics and add them to grafana dashboard #246