open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
282 stars 175 forks source link

GenAI (LLM): how to capture streaming #1170

Open lmolkova opened 5 months ago

lmolkova commented 5 months ago

Some questions (and proposals) on capturing streaming LLM completions:

  1. Should the GenAI span cover the duration till the last token in case of streaming?
    • Yes, otherwise how do we capture completion, errors, usage, etc?
  2. Do we need an event when the first token comes? Or another span to capture duration-to-first token from the beginning?
    • This might be too verbose/not quite useful
  3. Do we need some indication on the span that it represents a streaming call?
  4. Do we need new metrics?
    • see https://github.com/open-telemetry/semantic-conventions/pull/1103 for server streaming metrics:
      • Time-to-first-token
      • Time-to-next-token
      • Number of active streams would also be useful - streaming seems to be quite hard and error prone and users would appreciate knowing they don't close streams, don't read them to the end, etc.
  5. What should gen_ai.client.operation.duration capture?
    • same as span: time-to-last-token
karthikscale3 commented 3 months ago

Token Generation Latency is another metric that could be useful

TaoChenOSU commented 1 month ago

time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?

Another option would be we recommend people to indicate streaming or non-streaming in the operation name, such as streaming chat for streaming and chat for non-streaming.

lmolkova commented 1 month ago

time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?

good catch! maybe time-to-first-chunk and time-to-next-chunk ?