open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.74k stars 888 forks source link

Clarify how Application Developers are supposed to set the Span status when using Instrumentation Libraries #4131

Open fkoep opened 4 months ago

fkoep commented 4 months ago

What are you trying to achieve?

As an Application developer, I should be able to set the Span Status of Spans created by an Instrumentation Library.

In practice, a Spans lifetime is often tied to a library functions scope. It is not clear how you should gain access to the Span from the calling side or alternatively, how an Instrumentation Library should enable me to influence the Span Status.

Additional context.

In the Span::SetStatus operation (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status), the following line raises questions:

Application developers and Operators may set the status code to Ok.

When using an Instrumentation Library, how are Application developers supposed to do that?

It is stated that:

This means the only way an Application Developer can set the Span Status is by the Instrumentation Library giving him the ability to modify the Span Status while the Span is still open.

However, most often Spans are tied to a functions scope and not accessible by the caller.

Example: When using an instrumented HTTP Client library, an Application can expect 404 errors in certain places. The Application might look up an resource and create it if it does not exist yet:

// Application code
fn get_or_create_picture(id):
    let picture; // a picture resource

    /* check if resource already exists */ 
    let response1 = http.get(url="https://myapi/picture/{id}");
    if response1.http_status == 404:
         /* resource does not exist, create it */
         let random_pic = generate_random_picture();
         let response2 = http.post(url="https://myapi/picture/create/{id}", data=random_pic);
         picture = response2.content;
    else:
         picture = response1.content;

    return picture;

If the first HTTP GET returns a 404, the resulting GET /picture/{id} Span would be marked with an error Span Status by the the HTTP Client library (as per Semantic Conventions for HTTP Spans). We handled that error, so we want to set the status to OK. However, the Span is hidden away in http.get() and cannot be modified by the outer Application logic.

In order for the Application Developer the overwrite this Span Status, the HTTP Client Library would need to accept a parameter for expected HTTP statusses, for which the Span Status should not be set to error:

/* ... */
let response1 = http.get(url="https://myapi/picture/{id}", expected_statusses=[404]);
/* ... */ 

Alternatively, the HTTP Client Library would need to return the still-open Span, so the Span Status can be overwritten by the Application code:

/* ... */
let response1 = http.get(url="https://myapi/picture/{id}");
if response1.http_status == 404:
    response1.tracing_span.setSpanStatus("OK");
/* ... */ 

Is that really how it is supposed to be done? That seems like a major imposition on Instrumention Libraries.

In my pseudocode example, the design of the REST API and Application code is debatable, but the Tracing specification clearly states that setting the Span Status like this is intended usage. Various Github Issues I found suggest the same. However, I cannot find any real-life examples.

Maybe I am overseeing something really simple which could be explained or recommended in 1-2 sentences in the Tracing API specification:

Application developers and Operators may set the status code to Ok. Instrumentation libraries must provide the ability to XYZ...

svrnm commented 4 months ago

Hey @fkoep, we have an open issue that might include this specific use case, can you take a look if this if implemented would solve your issue:

https://github.com/open-telemetry/opentelemetry-specification/issues/1089

fkoep commented 4 months ago

I do not believe so. In the application code, you do not have access to the Span in question, not it's Span Id nor any Attributes, nothing. In a theoretical OnEnd(Span), you would not be able to deduce if the error has been handled.

The issue seems much more fundamental. The entire idea of the Span Status API is for the Application Developer to be able to overwrite the Span Status in the end:

Ok - The operation has been validated by an Application developer or Operator to have completed successfully.

In practice, when using an instrumentation library, the Application developer has no ability to do this in a controlled way, unless the Instrumentation Library provides it for every single instrumented function.

My initial example actually does not matter. The Application developer should be able to overwrite the Span Status of invoked instrumented functions, regardless if the Instrumentation Library set it to error or left it at unset.

It is the most basic usage of the Tracing API, as it seems intended and is specified.

danielgblanco commented 4 months ago

I think what @svrnm was referring to is the possibility of using a Span processor to implement the OnEnd(Span) hook. At that point, you have access to the Span and, it the span was writeable (which it isn't at the moment), you could modify the status.

I can see this being useful to handle spans created by instrumentation libraries. For instance, imagine you want to unset the ERROR status of spans created by an HTTP client instrumentation library when their http.status_code is 4xx, maybe because it's calling a cache proxy and you don't consider those 4xx responses an error. At the moment, this can only be done at the Collector level, but ideally one wouldn't need a Collector processor to do this.

Would this be something that would fulfil your requirements?

fkoep commented 4 months ago

In the OnEnd(Span) hook, how would you figure out which error Span's are to be updated? You would need to deduce this from contextual information of a Span whether it's error Status should be overwritten.

I guess you can hardcode some rules, such as "always overwrite the status of a Span which is the nth child to a parent Span with name 'X'".

// the second lib_function() call can result in an error, which is then handled
@instrument
def my_function():
   lib_function();

   result=lib_function();
   if result.is_error():
       handle_error(result);

   lib_function();
Resulting Tracetree:

my_function       <--- X is "my_function"
|--- lib_function
|--- lib_function  <--- this is the second child, always overwrite `error` status here
|--- lib_function

This becomes even more difficult when you have different possible tracetrees (due to branches and loops) or if you don't actually handle the error in all cases and want to leave the status on error.

I'm feeling a bit crazy here, because I don't think this is a special requirement of mine, but simply a requirement of the Tracing API itself: The Span Status needs to be overwriteable by the Application developer. And if I understand correctly, this is currently only realizable by the Instrumentation Library enabling it (which they probably wont) or by working at the Collector level?

It is regrettable, but it would answer the question. It would mean that in practice, Application developers encounter an error Span on their Observability Platform, fail to get rid of it, and then just categorically overwrite the Status of all Spans with that name, regardless if the error was handled or not.

danielgblanco commented 3 months ago

Thanks @fkoep, I think I got a clearer understanding of the issue now, and of course you're not crazy, this is clearly a valid use case. While https://github.com/open-telemetry/opentelemetry-specification/issues/1089 would allow to update the status of a span, it can only consider the attributes present in a given span to do this, which would have to have been added initially by the instrumentation lib. This may help in many situations, as in my example, but not in others, as you pointed above.

However, as you can see in the discussion in https://github.com/open-telemetry/opentelemetry-specification/issues/1089 making a span writeable after it has been ended has many implications to be discussed. This type of change would also require having access to spans outside of their instrumentation scope, or outside a processor, which is also challenging. It would be a non-trivial change.

While there may be alternative solutions for this, I will defer to the Technical Committee (TC) to continue the conversation and give their perspective on this issue.

svrnm commented 3 months ago

Thanks @danielgblanco and +1 to the point that you should not feel crazy @fkoep, this is a legitimate request, but might not be easy to implement.

An alternative that came to my mind: if you wrap your spans from the downstream library into a parent span you have full control over, would it work to set that parent span with status OK?

P = new span;

my_function
|--- lib_function
|--- lib_function  (generates child span 'C', with status `Error`)
|--- lib_function

P.setStatus('OK')
joaopgrassi commented 3 months ago

Just playing devils advocate here, but in the use-case described, I'd actually prefer to see the 404 span and then the other span creating the resource. When reading this I got the feeling that we are modifying "reality" and hiding the fact that a resource did not exist and had to be created. I'm not sure if allowing such thing is a good idea even.

Of course this varies, but I think this is useful information I'd like to know from my apps.

jsuereth commented 3 months ago

TC Triage: We are currently discussing this issue. We think broad guidance on error modelling in Spans for OpenTelemetry is needed and will guide the future of this request.

danielgblanco commented 2 months ago

@joaopgrassi I can see your point. However, leaving the 404 as an error when the user of that instro lib would not consider it an error becomes more of an issue when tail-sampling is involved. Storing 100% of traces with errors means that somebody using cache proxy style HTTP service would make tail-sampling store a high percentage of traces with low ROI.

One could argue that 404 may not be the right response code there, but the client may not always be able to influence the server to change it. It may be doable with sampling policies to not consider those cases, but it does make config on collectors more complex (either to unset status or to craft a more complex error tail-sampling policy).

fkoep commented 2 months ago

Thanks for the responses! Good to see that it is being discussed.

@svrnm: An alternative that came to my mind: if you wrap your spans from the downstream library into a parent span you have full control over, would it work to set that parent span with status OK?

As far as I am aware, the status of the parent span has no bearing on how the status of the child spans are to be interpreted? Of course, if the OpenTelemetry semantics were to be changed this way, it would solve the problem.

@joaopgrassi When reading this I got the feeling that we are modifying "reality" and hiding the fact that a resource did not exist and had to be created.

Well, you can spin that argument further and say that the OpenTelemetry Tracing API does not allow to accurately model the reality that an error has occured and has been handled.

(Maybe instead of the enumeration values (Unset, Error, OK), something like (OK, Error, ErrorHandled) would have been preferable for this, or having the SpanStatus represented as two booleans {is_ok: bool, has_error: bool} to show the fact that there was an handled error. But this is a more general discussion about the SpanStatus. In the end, (Unset, Error, OK) is what OpenTelemetry went with. And well, you still have the Exception SpanEvents, which can serve the purpose of marking OK spans with handled errors/exceptions.)

I can see the argument that one would like Spans with handled errors still be marked as Error. However, I would add to @danielgblanco's response that depending on how you have set up your Alert rules and system health metrics, you would also have to adjust them to filter out all these expected errors, lest you want to get spammed with error reports.

@danielgblanco

Good point about the tail sampling! Actually, in our case, it is even worse than I have described earlier... We work against a proprietary service we have no control over, and besides the sporadic 404's we get when a resource is deleted, we also need to look regularly look up metadata for each resource. About 10% of resources do not have any associated metadata, which will also be reported by a 404. Should they have designed it that way? Probably not, but we can't change it....

And having >10% of our traces be marked as erroneous makes our error overview useless. In the end, we simply decided to turn off the tracing feature of our REST client library.

lmolkova commented 2 months ago

Discussed at TC meeting:

Ability to change error status on spans based on the application logic is an important part of a wider problem: how applications or other instrumentations can interact with inner instrumentation layers to enrich and customize their telemetry. It affects spans, metrics, maybe other signals.

We want to tackle this problem in general. One of possible solutions could be to have scoped cross-signal callbacks allowing to customize telemetry items and/or classify errors.

See this writeup for more details.

Related:

The immediate mitigation to this specific issue (brought up by @svrnm and @danielgblanco in the comments) is to use SpanProcessor OnEnding introduced in #4024.

It should be possible to identify spans which status should be changed in the callback and change status only for them. E.g. in the example above the filter would be similar to this: span is client and http.status.code is 404 and url.full starts with https://myapi/picture/. It could also be possible to use execution context (async/thread locals) to propagate things from application code to span processor.

jmacd commented 2 months ago

There are some confusing assumptions being made in the thread above. While I agree with @lmolkova there is a larger ssue and potentially room for improvement of the tracing interfaces, I find the assertions about "error traces" problematic.

@fkoep I understand you have a scenario where you have an inner span returning 404s, which makes it look like 10% of your traces are "errors". However, as you point out, your application is able to handle those errors and so the parent span does not have an error. I believe you should be looking at the root span of a trace for your tail sampling decision, and then you will only sample true errors. Otherwise, I'm not sure how you're applying the logic of "tail sampling": usually this refers to a decision made about a whole trace, and there is no reason you need to inspect the inner 404 error spans while making your tail sampling decision. Therefore, it seems the way to resolve this problem is simply to stop interpreting inner span errors as errors: let them propagate to the root span and then consider them errors.

lmolkova commented 2 months ago

+1 to what @jmacd said - protocol level failures (even when they are real errors) are a normal part of network communication that are usually retried and recovered from - they should not be used alone.

Tail-based sampling and alerts would be more efficient and accurate if built around server spans (or something that application reports about itself) rather than client communication.

fkoep commented 2 months ago

@jmacd

There are some confusing assumptions being made in the thread above.

What assumptions do you mean? To consider a trace as "erronous" when it contains an error span?

We've been using a Jaeger-based Observability Solution, so maybe that's where this interpretation comes from. (In Jaeger, any trace which contains an error span is being visualized in red color. And when debugging an issue, it's common practice to just search with a "error=true" filter across all spans of all traces.)

I'm not sure how you're applying the logic of "tail sampling": usually this refers to a decision made about a whole trace, and there is no reason you need to inspect the inner 404 error spans while making your tail sampling decision.

From different docs, it seems like sampling based on the span status of non-root spans is common practice:

https://opentelemetry.io/docs/concepts/sampling/ Some examples of how you can use Tail Sampling include:

  • Always sampling traces that contain an error

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor#a-practical-example Rule 5: Sample all traces if there is an error in any span in the trace.

Not to throw any shade on your post, you are providing a valid alternative approach. I'm just not sure which assumptions you are contending.


To your approach, I see difficulties with reporting non-fatal unhandled errors in subroutines which don't propagate up.

A "fatal unhandled error", I would expect to propagate all the way up to the Root Span, as it fits the way the code is structured, too.

With "non-fatal unhandled error" (maybe there is a better term for it) I mean an error which occurred in some subroutine, but for which the application switched to some fallback behavior. The error was not really handled, it was just non-fatal. You still want an error span emitted, the trace sampled, and for a human to look at the issue. However, in the code no information about these non-fatal errors is propagated all the way up to the Root Span location.

I'm not sure how common this is in other code bases, in ours it it rather common. We sit between different external services and if some minor things are not working as expected, we rather just log an error and continue instead of halting our entire service.


@lmolkova Oh yeah, you can make use of SpanKind in the tail-sampling policies! That's a good point.

protocol level failures (even when they are real errors) are a normal part of network communication that are usually retried and recovered from - they should not be used alone.

So summarized: Rather than overwriting any SpanStatus, you would recommend leaving the CLIENT-Spans as-is and only sampled based on the status of the higher-up INTERNAL/SERVER-Span? Do i understand that correctly?

That would seem to be the solution to our personal issues. Though now I'm a bit troubled to understand in which practical cases you would ever overwrite a SpanStatus from ERROR to OK, as this issue seemed to be the model case for that.

pyohannes commented 1 month ago

Ability to change error status on spans based on the application logic is an important part of a wider problem: how applications or other instrumentations can interact with inner instrumentation layers to enrich and customize their telemetry. It affects spans, metrics, maybe other signals.

I'm not sure changing the error status on existing spans is a viable solution. An application might call to a different service, if an error is returned, the calling service might be able to handle that. However, there's no way to change the status of the remote span(s), which are errors.

Errors can happen in some contexts, and be handled in other. That should be modeled in traces. The notion that all traces with errors are problematic is quite deeply engrained in many current tools and workflows, and I would argue that at least for synchronous workflows one should move away from that notion.

However, as you point out, your application is able to handle those errors and so the parent span does not have an error. I believe you should be looking at the root span of a trace for your tail sampling decision, and then you will only sample true errors.

I agree with that, with the caveat that things are a bit more complicated for asynchronous workflows.