Closed ThomasVitale closed 1 month ago
I do agree that INTERNAL makes sense too.
It's not a GenAI-specific problem - the same problem exists in messaging, database, or any other conventions for logical calls that happen on top of instrumented transport.
But you never know if this transport is instrumented. Application may choose to disable (not enable) it, or disable it in scope of a specific library. E.g. AWS client instrumentations usually allow to suppress nested HTTP spans.
Arguably client-side instrumentation never knows if the transport is instrumented, therefore cannot set kind conditionally. If the underlying protocol is not instrumented today, it does not mean it will never be instrumented. I.e. there is no way for the GenAI (db or messaging instrumentation) to know which kind it should use.
Long story short, I think the client/internal span distinction is not useful as it exists today. All other logical client semconv (messaing/db, grpc) also require client kind for the above reasons.
Because spring instrumentation began with zipkin semantics, it can be surprising to find things that lie over an http transport, and have no ability to set headers, marked as client anyway. Partially replicating transport properties such as server address is even more surprising. I had the same feedback.
In zipkin, not otel, instrumentation "client" indicates "exit span" outbound trace propagation except where there is none possible (db clients). The most common client were http and grpc, both of which could set headers, as well any other rpc'esque thing like dubbo. Sure double-instrumentation could happen, but in zipkin we would never mark something client that is most often a layer over http, unless it could set headers. So, absolutely, this is different.
What I learned from folks including @lmolkova is that client is vague. There is no coupling with "remote" or "exit span" or ability to set headers. It is a software category and says nothing about transport.
So, I agree that client/internal span is a coin toss. It bothers me we set this to client, but not as much in context of how semantics are defined.
Related: I was extremely surprised to find all of the instrumentation sdks skipping the http client under openai or all the other REST clients that implement this. I tried half a dozen: not one added http instrumentation to openai calls.
I would expect by end of the year latest, many people will notice this missing, and ask to fix that, especially as failures at the transport layer are important to know. So, more people will notice this as a result.
More people seeing this in practice may change the coin toss, but it may also result in the same. We'll see!
Thank you both for your answers. They helped me understand better the topic. I was indeed thinking of client spans in OpenTelemetry the same way they are used in OpenZipkin: an "exit span" towards a remote service. I might have overlooked a bit the OTel definition. I guess the key part is "usually", so it's not always an "exit span" towards a remote service. Did I understand correctly?
CLIENT Indicates that the span describes a request to some remote service. This span is usually the parent of a remote SERVER span and does not end until the response is received. (OpenTelemetry Traces Spec)
Still, it seems a bit odd to me getting a series of nested spans with kind CLIENT
. But I can see the general challenge in knowing whether the underlying transport layer is instrumented or not, which is not a problem exclusive to the GenAI semantic conventions and not something that can be fixed here or within the context of a single library. Perhaps it's more evident with GenAI since one of the main aspects of this new AI wave is that the models are "one HTTP API away" and application developers can use them easily with any HTTP client. So it's more explicit that the AI layer sits on top of the HTTP layer in those scenarios. But it might be my biased point of view :)
With respect to Java/Spring, the transport layer comes with the core framework and it's always instrumented (unless the user decides to turn that off). For clarity and consistency with the rest of the Java ecosystem, I would still use INTERNAL
as the kind of GenAI spans in Spring, but I guess that would mean not adhering completely to the Semantic Conventions.
@brunobat what's your take from Quarkus?
Related: I was extremely surprised to find all of the instrumentation sdks skipping the http client under openai or all the other REST clients that implement this. I tried half a dozen: not one added http instrumentation to openai calls.
I noticed that too. And that might become a problem soon, because some backend solutions/platforms that are embracing the GenAI Semantic Conventions seem to have some implicit assumption that an LLM-powered application would only send GenAI spans and metrics. When using those services from a production application instrumented at different levels, some of those solutions break because they are not expecting other types of spans/metrics other than the GenAI ones.
Something similar might happen with vector databases. I haven't look that deep into that part of observability yet, but it seems many libraries are not considering existing instrumentation of underlying DB transport libraries.
@ThomasVitale PS I want to be very clear, even though I'm not asked and also have no karma on this repo.
I understand the argument that you CAN make the llm span kind=client, but 100pct think that's the wrong optimization. The least bad choice is INTERNAL, and doing so encourages people to instrument the http calls underneath and prevent preventable problems such as broken tracing or lack of awareness of http failures.
Basically, it is more likely there is an actual instrumentable client under the llm span than not, and if that isn't there bad things will happen sooner or later. It is better to optimize for the usually case than the usually not in decisions.
Even though I yield to things I don't agree with or don't prefer, this is not the right choice and should be corrected.
History
TL;DR; kind had been undefined until april this year. It isn't completely converged either, even in new implementations. However, most current python sdks set kind=client. As far as my research shows, @drewby made the suggestion to set this to client and @nirga accepted that feedback. I see no evidence of a long discussion about the nuance of this choice, but maybe it happed in slack or on a call.
So, the first openai instrumentation was from @cartermp over a year ago, and this commit "Add embeddings and make it work with autoinstrumentation" switched to kind=client.
In Sep last year, carter raised an issue to make define llm semantics. His proposal didn't actually indicate a span kind preference at the llm level, so it would have been internal by default. He forked the semantics repo and worked on upstream definition, which became a PR in early November. This also didn't indicate a span kind preference.
@nirga took over the effort after above fizzled out into a new PR this march. Initially it also didn't indicate a span kind
Before this PR was merged, @drewby requested a must on kind=client, and Nir accepted that. So basically, this is the history.
The primary language community has been python, and so the specs have been primarily driven or followed by python impls. I have looked at many python sdks now with at least two developers: AgiFlow, langrace, litellm, openinference, openlit, and openllmetry. Most have converged to kind=client, either intentionally or due to "born after date" as it is still a new ecosystem.
A datapoint on this of interest was that there are still a couple python SDKs which have kind=internal. I highlighted to each that the spec asks for this to be kind=client despite not personally agreeing with it.
openinference cc @mikeldking https://github.com/Arize-ai/openinference/blob/22a93f9d8445ddb7269e23b5b10fcd007cc2a0d2/js/packages/openinference-instrumentation-openai/src/instrumentation.ts#L133 litellm cc @ishaan-jaff https://github.com/BerriAI/litellm/blob/main/litellm/integrations/opentelemetry.py#L248-L252
Next steps:
If we agree that the choice was made without the context provided here, I think it is fair to revisit this decision. We have changed other properties in many incompatible ways. My suggestion is to get feedback from @drewby about the "must" part, and weigh in based on new information that exists today.
The approver+ community of this spec is currently @lmolkova @drewby @cartermp and @nirga and a change would require buy-in from them, informed by implementation opinions as well passer bys.
Hope this helps!
We discussed this in one of the early Working Group meetings. The primary topic was on how to name the spans, and within that conversation, we considered other semantics for database, HTTP, etc., and decided on CLIENT as the span type. I want to ensure you that this was discussed, albeit briefly, though it may not be captured in the review/commit history.
One argument for keeping this as CLIENT is the differentiation it provides in orchestrating tools like LangChain. Specifically, it helps distinguish between internal steps (such as combining context and prompt) and external steps calling the LLM.
Referencing the SpanKind documentation:
Note: there are complex scenarios where a CLIENT span may have a child that is also logically a CLIENT span, or a PRODUCER span might have a local child that is a CLIENT span, depending on how the various libraries that are providing the functionality are built and instrumented. These scenarios, when they occur, should be detailed in the semantic conventions appropriate to the relevant libraries.
In the case of instrumenting orchestrators, there is a logical span that is calling out to a server which is different than an internal span doing work locally.
Given that our working group membership is expanding, and our conventions are marked experimental, revisiting this decision is possible. We could maintain the CLIENT designation and improve the documentation to provide more clarity. Alternatively, we could re-evaluate the "MUST" requirement for CLIENT in light of new information and feedback, or remove the CLIENT requirement and default to INTERNAL, aligning with the initial design.
I want to re-iterate a few things:
So giving instrumentations a flexibility would mean that they would populate it inconsistently depending on author familiarity with different use-cases and configuration options.
Since other OTel semantic conventions REQUIRE CLIENT, GenAI should not be different and should REQUIRE CLIENT too.
When I'm saying that INTERNAL makes sense, I mean conceptually - the INTERNAL is described so vaguely, you can call anything an internal span.
@lmolkova do you mind being specific when you say "all other similar conventions" maybe edit your comment? I would expect "similar" to be something like rest layers, such as repository interfaces, control APIs like k8s, etc. being specific will allow us to be on the same page on what we feel things like OpenAI are similar to.
fyi I found this from @mikeldking https://github.com/Arize-ai/openinference/issues/609 to change openinference to client
trying to help add more rigor here, since I understand one argument is what this spec is closer to.
one thing is clear that when something is a subtype of RPC it would be client for this side of work.
So, for example, aws-sdk is a subtype of that so implied kind=client. aws-sdk is a control api riding over a rest interface, so that's common. However, we don't currently (to my knowledge) define llm/genai in terms of rpc. So, if this rpc implied maybe we can make that more explicit in the spec, such as done in aws-sdk
Something a little less precise is the graphql semantic which is also a layer over http. It says the word "server" in the title, but doesn't discuss RPC or span kind in the doc, yet.
Personally, I don't think any of the existing semver are by nature like genai (e.g. openai in impl), but I understand there are strong feelings to the opposite.
@codefromthecrypt Databases and Messaging come to mind. Both will either require CLIENT or CONSUMER/PRODUCER on top of another layer that could potentially be HTTP rest API. These define logical requests that have meaning (ie. descriptive attributes associated with a thing that is a request, but not something you'd add to http span).
@drewby real question.. how is a prompt/completion request similar to a database or messaging?
both database and messaging are logical operations that run on top of some protocol (HTTP, gRPC, something not instrumented on the low level like SQL),
In all cases, we create 1 logical span for outer operation and as many protocol-level spans as there were protocol calls.
E.g. if there are 2 retries, there will be
parent LLM/DB/Messaging span - OK
or if there was authentication call, the would be something like
parent LLM/DB/Messaging span - OK
I.e. LLM instrumentation applies on the public API of the corresponding client library and above any retry/redirect/auth handlers in HTTP (or other protocol) pipelines.
ok I see where you are coming from even though I personally believe database and messaging are far wider abstractions than this. Appreciate you explaining what you feel is similar 👍
Since native/auto HTTP/gRPC instrumentations are already there, the LLM has no other choice than to follow the same pattern:
I.e. it does not know whether there is a client span or there is no client span under it. So it has no better choice than to be client itself.
In this sense there is no difference between LLM, S3, or Elasticsearch.
Thanks everyone for all the comments. After reading all your contributions, it looks to me that the description of Span Kind in the OpenTelemetry Specification is lacking something, and I'm glad to see @lmolkova starting a change request there.
Did I understand correctly that all spans that eventually might lead to a remote service call should be of kind CLIENT
?
And I guess the same would be true for SERVER
spans, since you never know which layer is actually instrumented?
Would that mean the use cases for INTERNAL
spans would be really minimal, perhaps reserved for spans within single applications with zero input/output interactions with other systems or for spans for purely internal operations like "mapping a data structure to another", with no connection whatsoever to a possible incoming or outgoing remote call?
I'm also wondering to what extent we need to make up for missing instrumentations at the transport level when instrumenting other layers (e.g. LLM) that build on top of those. Are there guidelines about which attributes to "duplicate" across different levels when defining conventions (e.g. server.address
)?
Apologies if they're silly questions, I'm trying to understand better how to assign correctly a kind to a given span with OpenTelemetry, with the right attributes. I appreciate all your help with clarifying the subject.
Did I understand correctly that all spans that eventually might lead to a remote service call should be of kind CLIENT?
Yes.
And I guess the same would be true for SERVER spans, since you never know which layer is actually instrumented?
Yes to some extent. Since multiple server layers are (at least sometimes) coming from the same framework instrumentation, it could be possible to coordinate who creates SERVER within that framework. Still, in Java multiple nested SERVERs could easily happen.
OTel java supports suppressing nested spans - https://opentelemetry.io/docs/zero-code/java/agent/disable/#instrumentation-span-suppression-behavior
I believe the default is that 'same-type-of-semconv' spans are suppressed. E.g. HTTP clients could use each other and create duplicate spans. Java Instrumentation API in Otel suppresses nested HTTP client spans if there is already HTTP client span on the context.
It supports other forms of suppression - e.g. by kind. I.e. in the scope of LLM CLIENT, HTTP CLIENT will be suppressed.
Some users prefer to have only logical instrumentations, others want only physical ones or both of them - we want them to be able to pick the layers that they want.
Would that mean the use cases for INTERNAL spans would be really minimal, perhaps reserved for spans within single applications with zero input/output interactions with other systems or for spans for purely internal operations like "mapping a data structure to another", with no connection whatsoever to a possible incoming or outgoing remote call?
I think we don't have an answer to this in the community - https://github.com/open-telemetry/opentelemetry-specification/issues/4179
INTERNAL spans are de-facto used for things like background in-process jobs processing, web framework controller spans - i.e. those that neither server nor client.
I'm also wondering to what extent we need to make up for missing instrumentations at the transport level when instrumenting other layers (e.g. LLM) that build on top of those. Are there guidelines about which attributes to "duplicate" across different levels when defining conventions (e.g. server.address)?
We see in practice that:
As a result, LLM instrumentation cannot rely on HTTP one being there. It should be self-sufficient to some extent which leads to some duplication as you have noticed in https://github.com/spring-projects/spring-ai/issues/1174.
This duplication is essential (but also cheap) on metrics for the reasons I described in that issue. And even on tracing:
error.type
on HTTP and logical LLM could be different (HTTP returned 200, logical operation failed when receiving stream or deserializing content)server.address|port
could be different because of redirects, in-client load-balancing and fallbacks which might not (yet) be the case in LLM worldHaving said this, the only reason to REQUIRE CLIENT kind (vs RECOMMEND it) is to achieve consistency and it's not a super-strong reason.
Are there any case where:
I.e. I'm not really opposed to relaxing MUST to SHOULD (for all conventions), but instrumentations should have some very good reasons not to use CLIENT. The reasons I heard so far seem subjective to me.
One problem I can say is that currently no python probably also no javascript "sdk" I've seen is instrumenting http requests. It may be that they think this is a client and so they don't need to. There are some issues already raised about broken traces. When we water down what client means we add to confusion. We need to rebalance with advice about actually doing the client instrumentation.
I think the concept that all things that could result in client should also be client is extremely fragile, and confusing. Easiest way to break it is have a parent result in a client and a producer span.
Things like dependency link aggregation manifest as clients and while they can walk around reality bugs such as duplicate instrumentation (eg real duplicate like 2 http instrumentation, not one parent is a rest repository and the other an actual client). Basically, before making such decisions a survey of dependency graph aggregation impact is valid, even if they can work around accidental double-clients today. We are considering changing accident to routine.
I feel like this topic has grown its own legs. It is effectively turning all spans that were internal to client, except pure functions. I believe a very wide audience should be considering this carefully, with reality in mind both instrumentation and backend notably aggregation of topology and metrics.
That there's a loophole in the spec that allows one to get to all things client, doesn't mean we should. I think more people on this topic actively, those who have impact when it changes or themselves write backends, need to speak up loudly.
I feel I'm not heard. LLM instrumentations are not expected to instrument HTTP.
It's the responsibility of https://github.com/open-telemetry/opentelemetry-js/tree/main/experimental/packages/opentelemetry-instrumentation-http, https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-httpx and similar.
The CLIENT kind in semantic conventions is not a loophole, but a design decision. Following your concerns I brought it up on yesterdays semconv meeting and today's spec meeting (in the context of https://github.com/open-telemetry/opentelemetry-specification/issues/4179 and https://github.com/open-telemetry/opentelemetry-specification/pull/4178).
There is a general consensus that this is the inevitable direction. If you want to change it, please follow up on the spec.
UPDATE: also there are quite a few existing instrumentations that generate nested client spans. Yes, it creates issues for application maps - https://github.com/open-telemetry/semantic-conventions/issues/674 (and we have not heard from anyone but Azure Monitor that it's a problem 🤷♀️), but what I'm saying that it's technically impossible to know whether you're the last one and should be a client. So backends need to find another way to visualize maps.
When you write a document, you hope that implementations are coherent in the intent of the doc. I have been involved in instrumentation since before CNCF existed, tens years back when everything that wasn't a server was a client. I am aware that http instrumentation is a different responsibility that LLM layer. I do understand deeply what you are saying and am disappointed you feel unheard.
There is a difference between spec and reality. For example, when I entered this SIG I found nothing was compatible, and no one tracking that. The reason I can tell you that currently there are no entrypoint SDKs configuring that http instrumentation is because literally I have tried every one. Some add http attributes to the LLM spans, no one is adding instrumenting of the http spans. Please lets not jump to the conclusion that I expect the LLM span to be also HTTP. I am highlighting that despite writing a spec folks aren't doing this. On the topic of leaving things as client, we should acknowledge the reality and clarify the spec. When we write docs we hope it leads somewhere.. currently there isn't evidence most folks are yet led to instrument http except @ThomasVitale who noticed! In summary, I made a documentation clarification requests.
I don't want to revisit the internal vs client part anymore, or any other derivation of it in this comment even if you ask me.
I would like you to consider that folks are mostly skipping this now, and if we shouldn't add a sentence to improve that. Otherwise, I'll leave this to other people.
From what I see, OpenLit, Traceloop, Langtrace are not supposed to enable HTTP - they run inside user application and it's the end-user choice to instrument and configure HTTP layer. I feel it's important to clarify it.
Happy to learn if you meant something else.
Apologies for the late comment to this issue. At AGIFlow, we initially supported automatic HTTP instrumentation but decided to remove it. We found that users running Azure Durable Functions experienced excessive tracing noise due to the default networking instrumentation, due to implementation details of Azure Durable Functions.
For us, GenAI instrumentation goes beyond observability; it also acts as a data collector for compliance, model improvement, and other purposes. While HTTP instrumentation is valuable, it is not the sole focus of our approach.
We understand that some users may prefer to use other backends for observability. In such cases, we provide optional networking instrumentation when initializing the LLM instrumentation library.
IMHO, the scope of GenAI instrumentation is a bit wider than only for observability; as platforms like OpenLit, Traceloop, and Langtrace also support evaluation and additional functionalities alongside observability and metrics.
One last thing I'd like to add to the conversation is about those use cases where LLM operations are not actually relying on remote calls. Instead, the model inference service can run in process. For example, Python apps using the Hugging Face Optimum library or Java apps using the ONNX Runtime library to enable running models in the same process. Would those operations running fully in-process be "client" or "internal"?
For example, Python apps using the Hugging Face Optimum library or Java apps using the ONNX Runtime library to enable running models in the same process
good question. Do you think all other semantics would apply? Would it be reasonable to still report gen_ai.client.*
metrics?
I assume server.address|port
are not relevant, errors/cancellations could still happen, metrics seem to be the most controversial, but in general, it's still client-side telemetry (as opposed to server-side).
So I'd say that the answer is that INTERNAL would be the most applicable option in this case. And I do think (https://github.com/open-telemetry/semantic-conventions/issues/1315#issuecomment-2272372228) we can relax MUST to SHOULD. In-process model is quite a good reason to violate SHOULD. But I think we should relax it across all relevant semconv.
I added this issue to messaging and db projects and we'll discuss it there as well.
@brunobat what's your take from Quarkus?
I think the GenAI spans should be INTERNAL because they are produced by a lib. Yes, it communicates, but I imagine it will use a client lib of some sort to send/receive data... Only communication clients should use CLIENT.
@brunobat
could you please elaborate:
gen_ai|db|messaging|grpc-over-http.client
metrics be renamed to *.internal
😉 @brunobat
could you please elaborate:
- what would be the problem if both were CLIENT?
Not sure it would be a problem, just a miss labeling. I do think if it is not explicitly involved in the issuing side of a RPC, the span should have the default, INTERNAL SpanKind.
- what should happen if underlying transport instrumentation is disabled/suppressed/sampled-out?
If the instrumentation is disabled, it would behave like any other system, the CLIENT span would not show up. It is possible to have SERVER span without a CLIENT one.
- what if it's disabled, but still propagates current context?
Please define disabled. How would it be disabled?
- [UPDATE] what if library allows you to enable/disable underlying transport instrumentation?
We have that on Quarkus, particular spans will not be generated and in that case, propagation will break because the CLIENT span code is not executed... This is not a sample out, but a non creation.
- [UPDATE2] Should
gen_ai|db|messaging|grpc-over-http.client
metrics be renamed to*.internal
😉
We probably need examples to decide. One thing is sure, if it uses sockets or Netty or is some higher level HTTP or JDBC wrapper who's main concern is communication, it seems ok to use SpanKind CLIENT.
If it does application level logic deciding what to send, it doesn't look like a simple CLIENT, therefore, INTERNAL seems a better fit.
Please define disabled. How would it be disabled?
OTel now supports disabling instrumentation via configuration by tracer name. Instrumentation may do something like
if (tracer.isEnabled()) {
// instrument and run request
} else {
// just inject context from current span and run request
}
We probably need examples to decide.
How would you decide for grpc - in some cases it runs on top of HTTP (.NET) and both layers can be instrumented. Is gRPC INTERNAL there? Is it conditional depending on the implementation details?
I think I'm trying to make a point that CLIENT/INTERNAL distinction is very vague to the point it's not even helpful.
Metrics give a good perspective: db.internal.operation.duration
is not as descriptive and clear as db.client.operation.duration
Area(s)
area:gen-ai
Is your change request related to a problem? Please describe.
The current conventions for GenAI spans require the span kind to be
CLIENT
for GenAI applications interacting with model providers.Interactions with model providers rely on HTTP APIs, for which Semantic Conventions exist already. Production applications typically have already an established HTTP infrastructure in their architecture, which often include also instrumentation and observability already. Client libraries to interact with GenAI model providers can build on top of the existing HTTP infrastructure. For example, an OpenAI Client library would delegate to an existing (and already used in production) HTTP library to perform the actual HTTP calls. When that happens, the HTTP calls are already instrumented and observed.
When instrumenting such a GenAI Client, the result would be a span for the GenAI client and a nested/child span for the HTTP interaction (GenAI => HTTP). The GenAI part is a wrapper around HTTP, and it's probably worth maintaining that distinction instead of mixing up the two different concerns. However, in that case, the child span (HTTP) would be of kind
CLIENT
. And in order to follow the Semantic Conventions, the parent span (GenAI) would also have to be of kindCLIENT
, but that's confusing and possibly not one of the "complex scenarios" mentioned in the OpenTelemetry Traces Spec:Even just looking at the definition for the kind
CLIENT
, it seems wrong having the GenAI span asCLIENT
, even though it's the child HTTP span that ultimately describes "a request to some remote service".Describe the solution you'd like
If I understood correctly and it would not be correct to have two nested spans (GenAI => HTTP) both with kind
CLIENT
, then my suggestion would be to change the requirement for GenAI spans to allow bothINTERNAL
andCLIENT
kinds, letting the implementers decide on which one to use based on the library they are instrumenting.If the library doesn't use an already-instrumented HTTP library, then the kind would be
CLIENT
. Example: OpenAI Python library.If the library relies on an already-instrumented HTTP library, then the kind would be
INTERNAL
. Example: OpenAI Java library in Spring Boot and Quarkus.Once we acknowledge the common scenario with two nested spans in production applications (GenAI => HTTP), it's also worth pointing out that some of the span attributes and metric attributes included in the Semantic Conventions for GenAI would not be used when the GenAI span is of kind
INTERNAL
(e.g.server.port,
server.address
,error.type
), because they would belong to the child HTTP span.Describe alternatives you've considered
No response
Additional context
I mainly work with Java. The two main Java frameworks (Spring Boot and Quarkus) both come with existing instrumentation and observability for the HTTP infrastructure, which is the foundation on top of which libraries like Spring AI and Quarkus LangChain4j are built.
I'm working on instrumenting Spring AI based on the Semantic Conventions for GenAI. Currently, the GenAI spans are of kind
INTERNAL
and the HTTP spans are of kindCLIENT
. Example of trace for a GenAI library that uses an already instrumented (and used in production) HTTP library: