opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
215 stars 131 forks source link

[ChatQnA] Provide E2E performance metrics #391

Open eero-t opened 1 month ago

eero-t commented 1 month ago

Currently one can get inferencing metrics from TGI and TEI backend services, but there are no E2E metrics for the whole pipeline, e.g. what are the first response, and response continuation rates.

I think at minimum following (Prometheus) counter metrics would be needed from the ChatQnA frontend service:

That way one can get end-to-end latencies for user request processing, averaged over any interval, both the initial response delay, and rate at which response is being completed.

This can be used to monitor the whole service, and how the actual response-time to user requests is improved with backend scaling. Contrasting E2E metrics with the backend services inferencing metrics shows whether scaling of other activity, or e.g. improving how frontend uses the (scaled) backends, needs to be improved.

PS. Same applies also to other provided example services, but for now, I care only about ChatQnA.

eero-t commented 1 month ago

Providing such metrics is straightforward.

When user query comes in:

When replying with first token for that query:

When replying with further tokens for that query:

When receiving GET query for "/metrics" URL path, respond with current values for all counters:

# TYPE query_count counter - total count of end-user queries
query_count <total>
# TYPE first_tokens_count counter - total count of all first tokens
first_tokens_count <total>
# TYPE first_tokens_duration counter - sum of first token durations
first_tokens_duration <total>
# TYPE next_tokens_count counter - total count of all next tokens
next_tokens_count <total>
# TYPE next_tokens_duration counter - sum of next token durations
next_tokens_duration <total>

How to get Prometheus to scrape the metrics: https://github.com/opea-project/GenAIComps/issues/260

Note: query_count and first_tokens_count are separate counters because:

eero-t commented 1 month ago

Note: metrics should have relevant prefix, e.g. chatqna_ for ChatQnA service, so they can be identified better.

kevinintel commented 1 month ago

we will discuss how to implement it