katanemo / arch

Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with APIs - all outside business logic. Built by the core contributors of Envoy proxy, on Envoy.
https://archgw.com
Apache License 2.0
497 stars 26 forks source link

[llm-gateway] add tft and tot metrics and add them to grafana dashboard #246

Open adilhafeez opened 1 day ago

adilhafeez commented 1 day ago

Add TFT and total request time in llm-gateway.

For TFT the timer should be started when the request is received and timer should end then first token is received from upstream LLM.

The unit for metric should be milliseconds and should be stored in WasmMetrics struct.

salmanap commented 1 day ago

We need the following metrics instrumented

  1. Time to first token (TTFT): This is how quickly users start seeing the model’s output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
  2. Time per output token (TPOT): Time to generate an output token for each user that is querying the system.
  3. Latency = TTFT + (TPOT) * (the number of tokens to be generated)
  4. Output sequence length (OSL) - Total tokens generated
  5. Input sequence length (ISL) - Total input tokens

We need the following metrics histograms in 5 mins, 15 mins, 30 mins, 1 hour, 24 hours, 7 day granularity.

  1. Average (p50)
  2. P90 (90th percentile)
  3. p95 (95th percentile)
  4. p99 (99th percentile)