Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with APIs - all outside business logic. Built by the core contributors of Envoy proxy, on Envoy.
Time to first token (TTFT): This is how quickly users start seeing the model’s output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
Time per output token (TPOT): Time to generate an output token for each user that is querying the system.
Latency = TTFT + (TPOT) * (the number of tokens to be generated)
Output sequence length (OSL) - Total tokens generated
Input sequence length (ISL) - Total input tokens
We need the following metrics histograms in 5 mins, 15 mins, 30 mins, 1 hour, 24 hours, 7 day granularity.
Add TFT and total request time in llm-gateway.
For TFT the timer should be started when the request is received and timer should end then first token is received from upstream LLM.
The unit for metric should be milliseconds and should be stored in WasmMetrics struct.