Closed dosuken123 closed 7 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 95.98%. Comparing base (
c608c4e
) to head (3ae762e
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi @dosuken123 thanks for the proposal and implementation. I will be included in the next version that will be released sometime this week
@trallnag Thanks for help! Much appreciated :bow:
What does this do?
This PR adds an option to track HTTP response duration without streaming duration.
Config Example:
https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/blob/main/ai_gateway/app.py?ref_type=heads#L51-58
Output example:
Fixes https://github.com/trallnag/prometheus-fastapi-instrumentator/issues/291
Why do we need it?
Users often feel the latency as the first chunk arrival instead of the last chunk arrival as LLM inference APIs usually support HTTP streaming to improve the UX. We want to instrument the duration.
Who is this for?
GitLab, software developers, LLM app optimizations
Linked issues
Related to https://gitlab.com/gitlab-com/runbooks/-/merge_requests/6928#note_1796949998
Reviewer notes