GenAI: do we need to support multiple finish reasons?

lmolkova commented 2 months ago

See https://github.com/open-telemetry/semantic-conventions/pull/980#discussion_r1586695157.

Context:

some models return multiple choices in one response
each choice has a finish reason that explains why model stopped generating content which contains values like stop, length, filtered... - this could be very useful to know, query and aggregate by

Having array attribute is problematic since it's harder to query and not really possible to use on metrics or per-choice events.

Multiple choices are supported by a limited set of models. Event when multiple choices are supported some SDKs (e.g. openai-dotnet) make a choice not to expose it on the convenience API level to simplify the design and provide much more friendly experience. Most of examples and documentation assumes there is just one choice.

Given this, it seems that in most of the cases there will be just one choice and just one finish reason on each span.

The proposal is to

change type from string array to string
rename the attribute to gen_ai.response.finish_reason
report one value (either one of the following)
1. last reason
2. comma-separated reasons like stop,length
we can reuse this attribute on events in case we want to promote corresponding body field to attributes

lmolkova commented 2 months ago

More context on comma-separated list:

the attribute is useful on metrics (different finish reasons may correlate to different latencies and error rates)
theoretically comma-separated list has high(ish) cardinality
OpenAI may report separate metric for number of returned choices with an individual finish reason on each measurement

lmolkova commented 2 months ago

Based on offline discussions, we need to figure out batching story too and it might be related.

People may use n > 1 to save on costs (input tokens are charged once) - https://community.openai.com/t/how-does-n-parameter-work-in-chat-completions/288725.

Assuming it's one of the popular scenarios, the alternative to squishing finish reasons could be:

maybe populate finish_reason on span if and only if there is one choice. There could be other things we can do here:
- maybe populate error.type if the worst-of-finish-reasons indicates an error?
have a metric that measures number of choices (in addition to the number of requests) - report finish_reason as an attribute there
populate finish_reason as an attribute on the relevant event, so it's easier to query.

P2 and p3 seem mostly non-controversial and don't necessarily depend on p1.

open-telemetry / semantic-conventions

GenAI: do we need to support multiple finish reasons? #1277