Please revisit decision to emit an event for every message

stephentoub commented 5 days ago

The latest (1.28) spec requires emitting an event for every individual message. This is problematic in a variety of ways:

Messages that are part of the same request are not individual entities: they are only meaningful as part of the overall request being submitted to the LLM, which functionally turns the totality of those messages into a single prompt. The whole set of messages / the whole request should be logged, not each tiny piece of it individually.
A request might contain dozens or hundreds of messages. Logging each individually spams the listener, which then needs to do work to reconstitute the original request.
Any sampling being performed also then needs to do such work, as it otherwise might drop portions of a request.
The same message ends up getting logged on its own many times, each time a new request with the same history is resubmitted (augmented with additional messages). That's easy to understand when the messages are grouped, but when they're logged individually, it's much harder to reason about the duplication.

lmolkova commented 5 days ago

Logging all messages as one event comes with the following limitations:

hard to query. Different messages have different structure. As a result, messages would become heterogeneous arrays of objects.
individual timestamps are lost. user message happens at the beginning of the request, tool call (if performed by the service-side agent) at a later time, final completion happens at the end
harder to control verbosity. With current approach, If I want only choices and don't want to capture system messages or tool calls I can configure it (OTel will provide all the hooks and options and afaik, .NET is also considering per-event-name/id control). With all-messages-in-one-event, each instrumentation would need to provide some GenAI specific options.
different model from assistants. OpenAI assistants, AWS Bedrock agents don't require you to re-send chat history all the time, so we'd be forced to model things on per-message basis

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways). If users use the tool that's not showing chat history nicely, they probably don't care about the chat history and can disable those events or send them to low-cost destination.

The fact that chat history needs to be sent over and over again only shows how terribly inefficient existing GenAI solutions are and I don't think we should hide it.

[Update] I'd expect any distributed application that does anything complicated to generate 10-100+ info logs for each trace even if it has nothing to do with AI. Enabling debug logs would add another order of magnitude. Existing tools don't show it nicely, that's true.

stephentoub commented 5 days ago

Different messages have different structure. As a result, messages would become heterogeneous arrays of objects.

That's the nature of the system; they do have different structure, and they are heterogenous.

user message happens at the beginning of the request, tool call (if performed by the service-side agent) at a later time, final completion happens at the end

For non-streaming, that's not the case: the completion is logically all created at a single moment in time and shows up as a single response payload. For streaming, sure, but the spec already doesn't reflect the nature of streaming, e.g. you could get two tokens at very different times, but there's no notion of that in the spec, which necessitates that a streaming response be composed back into manufactured messages.

With all-messages-in-one-event, each instrumentation would need to provide some GenAI specific options.

I'm not following. The system needs to have a mapping of what it would emit at various verbosity levels. If it's emitting individual messages, it needs to emit them at that level. If it's emitting them altogether, it can simply query for the current verbosity level (which any respectible system does anyway before logging) and choose to compose only the pieces that are relevant at that verbosity level. Any options to configure this are relevant to both approaches.

OpenAI assistants, AWS Bedrock agents don't require you to re-send chat history all the time, so we'd be forced to model things on per-message basis

I disagree. You log what you send/receive. In the case of chat completion, what you send is a list of messages. In the case of assistants, what you send is a set of messages, it just needn't be the complete set because it may be augmented with a further set that already exists in storage. The telemetry would include what you send and receive, in both cases.

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways).

I do not see how it degrades the experience. In both cases, the tooling needs to understand the format and the data.

If users use the tool that's not showing chat history nicely, they probably don't care about the chat history

I'm a user. I care about chat history. My tool is not showing it nicely.

and I don't think we should hide it

How is it hiding it? In both cases, all the content is logged. It's just a question of whether it's logged in a way that reflects reality (they're physically and logically all part of the same prompt), or whether the telemetry is manufacturing boundaries that don't exist.

I'd expect any distributed application that does anything complicated to generate 10-100+ info logs for each trace even if it has nothing to do with AI.

100%. All the more reason to not introduce artificial boundaries where they don't exist. If I have a sentence I want to log: "This is a test of the emergency broadcast system", can you imagine if we recommended that be logged as 9 distinct entries?

logger.LogInformation("This ");
logger.LogInformation("is ");
logger.LogInformation("a ");
logger.LogInformation("test ");
logger.LogInformation("of ");
logger.LogInformation("the ");
logger.LogInformation("emergency ");
logger.LogInformation("broadcast ");
logger.LogInformation("system");

yet that's exactly what this design is suggested be done for the prompt to the LLM, splitting the prompt into pieces that are then logged individually, even though it's all just part of the same input and is not interpretable separately.

lmolkova commented 5 days ago

Let me tackle things one-by-one

That's the nature of the system; they do have different structure, and they are heterogenous. I do not see how it degrades the experience. In both cases, the tooling needs to understand the format and the data.

Different structure is a good signal that messages should be in separate events. The more complicated the structure is, the harder it is for the end users to parse it. Tooling suffers from it too, but I'm less worried about tools.

Imagine I'm writing a query that would return all completions with finish_reason that's not "stop". I'd write something similar to (in KQL)

logs | where name == "gen_ai.choice" | extend content = parse_json(body) | where content["finish_reason"] != "stop"

I can spend 15-45 mins to write a query that'd do the same for events like one below. I've done it a few times in the past and it's not easy. The hint is to use mv-expand if I recall correctly

{
      "name": "gen_ai.content",
      "body": {
        "messages": [
          {
            "role": "system",
            "content": "You are a helpful assistant."
          },
          {
            "role": "user",
            "content": "Hello!"
          }
        ],
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "\n\nHello there, how may I assist you today?"
            },
            "logprobs": null,
            "finish_reason": "stop"
          }
        ]
      }
    }

I encourage you to try it yourself.

You can say that it's hard to do in KQL and you might be right.

Let's compare it to JQ:

Now:

.logs[] 
| select(.name == "gen_ai.choice" and .body.content.finish_reason != "stop")

One event:

.logs[] 
| select(.name == "gen_ai.content")
| .body.choices[]
| select(.finish_reason != "stop")

both are relatively simple, yet I spent 1 min to write the first one, and a few more to write the second one

lmolkova commented 5 days ago

For non-streaming, that's not the case: the completion is logically all created at a single moment in time and shows up as a single response payload.

I disagree. You log what you send/receive. In the case of chat completion, what you send is a list of messages. In the case of assistants, what you send is a set of messages, it just needn't be the complete set because it may be augmented with a further set that already exists in storage. The telemetry would include what you send and receive, in both cases.

This is not the case for agents/assistants where you do add messages one-by-one and agent manages your chat history separately from the completions. Or when you report individual events (too calls) as they occur during thread run - PTAL at the https://github.com/microsoft/opentelemetry-semantic-conventions/pull/3

Also what you're saying that if we log HTTP request and response bodies we should log them together since they occur at the same time and describe the same thing creating visual noise in my tool.

lmolkova commented 5 days ago

I'm a user. I care about chat history. My tool is not showing it nicely.

We can fix the tool or change the telemetry (by making it harder to query and control verbosity), why should we change the telemetry and not the tool?

stephentoub commented 5 days ago

yet I spent 1 min to write the first one, and a few more to write the second one

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

lmolkova commented 5 days ago

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

If only tool can group and visualize them under spanId...

stephentoub commented 5 days ago

This is not the case for agents/assistants where you do add messages one-by-one and agent manages your chat history separately from the completions.

Sure it is. You log what you have in the largest unit you have them.

lmolkova commented 5 days ago

Sure it is. You log what you have in the largest unit you have them.

I don't think we should have different telemetry formats for agents and chat completions just because existing tooling can't show verbose logs nicely.

stephentoub commented 5 days ago

why should we change the telemetry and not the tool?

Why should the telemetry make it harder for the tool by exposing the data factored at a finer granularity than it actually is in reality?

stephentoub commented 5 days ago

Sure it is. You log what you have in the largest unit you have them.

I don't think we should have different telemetry formats for agents and chat completions just because existing tooling can't show verbose logs nicely.

You're going to end up with different telemetry, regardless, eg if each turn for completions logs the same message multiple times and only once or zero times for assistants.

stephentoub commented 5 days ago

Also what you're saying that if we log HTTP request and response bodies we should log them together since they occur at the same time and describe the same thing creating visual noise in my tool.

I'm not saying that. The request and response are separate from each other and can be logged separately, just as an LLM prompt can be logged separately from the completion. I'm objecting to artificially splitting the request or artificially spitting the completion, just as I'd object to doing so for HTTP. I don't want a separate log entry for every HTTP header, for example.

stephentoub commented 5 days ago

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

If only tool can group and visualize them under spanId...

Why is the tool capable of that but not basic search?

lmolkova commented 5 days ago

Why is the tool capable of that but not basic search?

What do you mean by basic search? I can write a relatively simple query that would group logs by the spanid and traceid in KQL or JQ.

You're going to end up with different telemetry, regardless, eg if each turn for completions logs the same message multiple times and only once or zero times for assistants.

I'll end up with different telemetry, but the structure of individual telemetry items will be the same.

I'm objecting to artificially splitting the request or artificially spitting the completion, just as I'd object to doing so for HTTP. I don't want a separate log entry for every HTTP header, for example.

Would you put tool definition inside prompt event? I would not because I'd like people who don't want to log tool definitions to be able to disable that event using generic log filtering/config. Also I want to have the same event defining tool call for assistants and for chat completions.

So splitting may look artificial, but it does not mean it's bad.

If we summarize:

Group prompt messages together.
- Pros: fewer logs (same volume/cost), look nicer in the tools that don't provide GenAI-specific experience
- Cons: harder to query, inconsistent individual event format with assistants, harder to control verbosity - elaborated in https://github.com/open-telemetry/semantic-conventions/issues/1621#issuecomment-2502330515
Use individual message as event
- Pros: easier to query, individual events are consistent with assistants, verbosity control by event name. Tools that visualize chat visualize individual messages anyway
- Cons: looks bad in tools that do generic log visualization

I'm happy to continue the discussion and try looking for options 3, 4, etc, but picking between 1 and 2, I don't see why 1 would be better.

samsp-msft commented 4 days ago

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways).

Having all the data in multiple log messages is going to make analysis tools considerably harder to write - they will need to store all the messages and then run queries to piece together the logs to build the bigger picture. If all the data is in one payload, its much easier to do batch operations or parallel processing as each log entry tells the complete story for that gen_ai call.

lmolkova commented 4 days ago

Having all the data in multiple log messages is going to make analysis tools considerably harder to write - they will need to store all the messages and then run queries to piece together the logs to build the bigger picture. If all the data is in one payload, its much easier to do batch operations or parallel processing as each log entry tells the complete story for that gen_ai call.

would it mean grouping prompts, tool definitions, attachments, tool calls, tool results, and completions into one uber event that captures everything?

The moment you separate prompts from completions into different records, you get a problem of bringing them together.

I'd like to challenge assumptions that

This is the absolutely most important scenario we need to optimize for, important enough to make other scenarios terrible
That it's hard to group logs by span-id and trace-id. Here's the ~query logs| summarize make_list(bag_pack("name", name, "content", body)) by trace_id, span_id

SteveSandersonMS commented 4 days ago

Is the goal to record the underlying physical reality of the protocol, or is it to record an interpretation of what happened in a form that eases certain queries? I can see arguments both ways.

Consumers who are comfortable with the wire format will likely bias towards seeing the "true" shape of the data, whereas consumers who have never seen the wire format but just want answers to high-level queries will likely prefer whatever makes their specific queries more intuitive.

I'm happy to continue the discussion and try looking for options 3, 4, etc, but picking between 1 and 2, I don't see why 1 would be better.

TBH I don't think the pros/cons list is that clear cut. For example, when it comes to ease of querying, you could say:

Group prompt messages together: easier querying if your goal is to do anything with the requests/responses themselves
Use individual message as event: easier querying if your goal is to do anything with individual messages

Ultimately it's hard to argue from first principles that there's a single right answer here. It depends on scenarios. Maybe in the long run it will become clear that tools and consumption scenarios all settle one way or the other. If that's already settled, it's news to me.

You could make the "first principles" argument that in the absence of definite evidence that people want to vary the structure from its original shape, it's a safer to preserve the original wire format since that carries some additional level of truth. I don't know how the decision was made to switch to event-per-message so maybe there were a large number of people agreeing that's best in their scenario, but that's not clear from this GitHub issue alone.

lmolkova commented 4 days ago

The principles we use to instrument libraries and write semantic conventions that have an impact on this decision are:

consistency across similar technologies. If OpenAI wire format changes tomorrow, we won't change semantic conventions to match it. We try to come up with a format that works for multiple backends. Also for similar scenarios we try to generate similar telemetry. It mostly does not matter how user generates completion (via assistant thread run or via chat completion that carries full history on each request), so queries/dashboard/other tools should not really care that much.
layered operations and separation of responsibilities. If you need protocol-level data, use HTTP/gRPC/etc telemetry - one can log HTTP request/response bodies for GenAI calls - GenAI instrumentation won't normalize/sanitize/extract important details. GenAI instrumentation creates logical operations that cover (potentially) multiple requests over arbitrary protocol (or even local ones if you run in-proc model). Toy can think about GenAI instrumentation as something wrapping the public API surface of the client - telemetry it captures intentionally has nothing to do with the protocol.
use-cases. We focus on what you do with telemetry (perf: throughput/latency, reliability: errors, debugging: tracing + events, evaluations: events + metrics). Tools that provide AI-specific experience, usually visualize messages separately. You might want to drop recording instructions since it's mostly static or provide them once when you create your agent. You probably want to run queries against choices.

There are many others. I encourage you to participate in semconv discussions. There are extensive discussions on the topic - check out #980 and linked issues + we have notes and recordings of calls where we discussed it - all available here https://docs.google.com/document/d/1EKIeDgBGXQPGehUigIRLwAUpRGa7-1kXB736EaYuJ2M/edit?tab=t.0#heading=h.ylazl6464n0c

SteveSandersonMS commented 4 days ago

Thanks @lmolkova, that's interesting and helps a bit with understanding the thought process.

To be clear when I referred to "record the underlying physical reality of the protocol" I meant representing the same conceptual information (e.g., having equivalent concepts and being at the same level of granularity), not the literal bytes.

can think about GenAI instrumentation as something wrapping the public API surface of the client

I guess that is meant quite loosely since you're also implying you don't want to reflect the differences between alternate API patterns (e.g., streaming vs nonstreaming have very different API shapes, likewise there tend to be different APIs for calling a stateless chat endpoint vs a stateful assistant).

open-telemetry / semantic-conventions

Please revisit decision to emit an event for every message #1621