Open lmolkova opened 5 months ago
Token Generation Latency is another metric that could be useful
time-to-first-token
and time-to-next-token
could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response
make more sense?
Another option would be we recommend people to indicate streaming or non-streaming in the operation name, such as streaming chat
for streaming and chat
for non-streaming.
time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?
good catch! maybe time-to-first-chunk
and time-to-next-chunk
?
Some questions (and proposals) on capturing streaming LLM completions:
gen_ai.client.operation.duration
capture?