open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
337 stars 164 forks source link

Error flagging with status codes #136

Closed tedsuo closed 4 years ago

tedsuo commented 4 years ago

This proposal attempts to address error handling while making the least number of breaking changes to our current system. It retains SpanStatus. Other proposals we essentially reinventing it, only in a way that would introduce a number of breaking changes. It allows current instrumentation to continue to work as-is, while providing guidance to instrumentation authors. It includes a way to differentiate between the default errors provided by instrumentation and intentional error flagging provided by the end user.

If accepted, this proposal would replace OTEP #134, and addresses issues #559 and #706.

carlosalberto commented 4 years ago

Overall looks good to me.

tedsuo commented 4 years ago

FYI I have updated this proposal based on feedback. Rather than adding additional status codes to represent user overrides, we add a user_override boolean which indicated that the end user has confirmed that the status code is correct. The expected behavior is that the status code on the span should take precedence over any further analysis.

Please let me know if this solves our remaining issues with error reporting.

tedsuo commented 4 years ago

Also, we will be discussing this proposal at the Errors Working Group tomorrow (thursday) at 8am Pacific. Please join if you can make it.

Details: https://docs.google.com/document/d/1HgUI69rBridFzCXxXuTjQrkG6jb-YcFQnZjPcyBcK1U/

iNikem commented 4 years ago

I beg your pardon. Do you I read the current proposal correctly, that all it does is adding new error_override flag to Span? I don't see how this solves any problem we tried to address.

It does not address open-telemetry/opentelemetry-specification#706. It does not address motivational part of #134 . It does not address part of its own motivation:

However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation packages.

I think the initial suggestion of this OTEP of reducing/renaming the enum of Span.Status made sense. Current form does not.

tedsuo commented 4 years ago

Ok, third draft is now available. This draft splits the difference between the first and second draft.

Changes:

Thank you all who have been contributing feedback! I am hopeful that this is the final iteration.

tigrannajaryan commented 4 years ago

@tedsuo please fix markdownlint errors.

tedsuo commented 4 years ago

Thanks @arminru. I adjusted the language to clarify the points you raised. Is this good enough to merge?

carlosalberto commented 4 years ago

Merging this as we have enough votes and don't want to decrease the velocity - and as @arminru said, we will iron out the actual details in the Specification.

cc @bogdandrutu

bogdandrutu commented 4 years ago

@carlosalberto not that I am totally against this (I would probably request changes in that case) but as mentioned multiple times, we cannot use the velocity argument to merge changes before GA because that may cost us a lot in the future if we merge half baked changes now to rush things.

carlosalberto commented 4 years ago

@bogdandrutu Oh, definitely. I merged cause this is 'merely' an OTEP, but let's definitely find full agreement when the actual changes hit the Specification (as long as they take).

tedsuo commented 4 years ago

@bogdandrutu: I agree, but to be clear, this was merged because it had more than enough votes, was open for an acceptable amount of time, and there were no further objections.