GenAI (LLM): how to capture streaming

lmolkova commented 5 months ago

Some questions (and proposals) on capturing streaming LLM completions:

Should the GenAI span cover the duration till the last token in case of streaming?
- Yes, otherwise how do we capture completion, errors, usage, etc?
Do we need an event when the first token comes? Or another span to capture duration-to-first token from the beginning?
- This might be too verbose/not quite useful
Do we need some indication on the span that it represents a streaming call?
Do we need new metrics?
- see https://github.com/open-telemetry/semantic-conventions/pull/1103 for server streaming metrics:
  - Time-to-first-token
  - Time-to-next-token
  - Number of active streams would also be useful - streaming seems to be quite hard and error prone and users would appreciate knowing they don't close streams, don't read them to the end, etc.
What should gen_ai.client.operation.duration capture?
- same as span: time-to-last-token

karthikscale3 commented 3 months ago

Token Generation Latency is another metric that could be useful

TaoChenOSU commented 1 month ago

time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?

Another option would be we recommend people to indicate streaming or non-streaming in the operation name, such as streaming chat for streaming and chat for non-streaming.

lmolkova commented 1 month ago

time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?

good catch! maybe time-to-first-chunk and time-to-next-chunk ?

open-telemetry / semantic-conventions

GenAI (LLM): how to capture streaming #1170