[ChatQnA] Provide E2E performance metrics

eero-t commented 1 month ago

Currently one can get inferencing metrics from TGI and TEI backend services, but there are no E2E metrics for the whole pipeline, e.g. what are the first response, and response continuation rates.

I think at minimum following (Prometheus) counter metrics would be needed from the ChatQnA frontend service:

request count
first tokens count for responses + sum of their duration
further tokens count for responses + sum of their duration

That way one can get end-to-end latencies for user request processing, averaged over any interval, both the initial response delay, and rate at which response is being completed.

This can be used to monitor the whole service, and how the actual response-time to user requests is improved with backend scaling. Contrasting E2E metrics with the backend services inferencing metrics shows whether scaling of other activity, or e.g. improving how frontend uses the (scaled) backends, needs to be improved.

PS. Same applies also to other provided example services, but for now, I care only about ChatQnA.

eero-t commented 1 month ago

Providing such metrics is straightforward.

When user query comes in:

Timestamp first token start time
Increase query_count counter

When replying with first token for that query:

Add diff from first token start time to first_tokens_duration counter
Increase first_tokens_count counter
Timestamp next token start time

When replying with further tokens for that query:

Add diff from next token start time to next_tokens_duration counter
Increase next_tokens_count counter
Timestamp next token start time

When receiving GET query for "/metrics" URL path, respond with current values for all counters:

# TYPE query_count counter - total count of end-user queries
query_count <total>
# TYPE first_tokens_count counter - total count of all first tokens
first_tokens_count <total>
# TYPE first_tokens_duration counter - sum of first token durations
first_tokens_duration <total>
# TYPE next_tokens_count counter - total count of all next tokens
next_tokens_count <total>
# TYPE next_tokens_duration counter - sum of next token durations
next_tokens_duration <total>

How to get Prometheus to scrape the metrics: https://github.com/opea-project/GenAIComps/issues/260

Note: query_count and first_tokens_count are separate counters because:

query may get an error instead of response tokens
in loaded situations there can be a long time between query being queued to backend, and it providing first token for it

eero-t commented 1 month ago

Note: metrics should have relevant prefix, e.g. chatqna_ for ChatQnA service, so they can be identified better.

kevinintel commented 1 month ago

we will discuss how to implement it

opea-project / GenAIExamples

[ChatQnA] Provide E2E performance metrics #391