open-telemetry / opentelemetry-rust

The Rust OpenTelemetry implementation
https://opentelemetry.io
Apache License 2.0
1.79k stars 412 forks source link

OpenTelemetry Tracing API vs Tokio-Tracing API for Distributed Tracing #1571

Open cijothomas opened 6 months ago

cijothomas commented 6 months ago

Background

The Rust ecosystem has two prominent tracing APIs: the OpenTelemetry Tracing API (Otel for short), delivered through the opentelemetry crate, and the Tokio tracing API, provided by the tracing crate. The OTel Tracing API adheres to the OpenTelemetry specification, ensuring alignment with OpenTelemetry Tracing implementations in other languages like C++, Java etc. Conversely, the Tokio tracing ecosystem, which predates OpenTelemetry, boasts widespread adoption, with many popular libraries already instrumented. The tracing-opentelemetry crate, maintained outside of OpenTelemetry repositories, act as a "bridge", enabling applications instrumented with tracing to work with OpenTelemetry.

The issue

The coexistence of the OTel Tracing API and Tokio-Tracing poses a dilemma, forcing end users to choose between two competing APIs. This situation complicates the decision-making process due to the absence of comprehensive documentation comparing the two options. A significant concern is the lack of tested interoperability between the APIs, which can result in issues, especially in applications where different layers use different tracing APIs, potentially leading to incomplete traces. This also impacts the log correlation scenarios as well.

A Comparison with OTel .NET

The OpenTelemetry .NET community encountered a similar challenge when the OTel Tracing API was introduced, as the .NET runtime library (shipped as the DiagnosticSource package) already had a similar API in place. This issue was resolved through collaboration between OTel .NET maintainers and the .NET runtime team, leading to the alignment of the .NET runtime's tracing API with the OTel specifications. This approach was later applied to the Metrics API as well. While the decision by OTel .NET to prioritize the .NET Runtime library's API over its own for tracing/metrics has generally been successful, it has not been without its challenges. Despite declaring stability years ago, OTel .NET has yet to implement certain aspects of the OTel specification fully.

Although the outcomes in the .NET ecosystem might not directly forecast the success of similar efforts in Rust, they provide a valuable reference point.

Options for Consideration

  1. Deprecate Tokio-Tracing: This approach would align Rust with the OpenTelemetry strategies adopted by other languages. However, considering the popularity and active maintenance of the tracing crate in the Rust ecosystem, this path has highest friction and is highly improbable.

  2. Deprecate OTel Tracing: Promoting Tokio-Tracing as the standard could be a feasible option, albeit requiring comprehensive evaluation. This strategy would cause OTel Rust to deviate from its counterparts in other languages. Potential alignment of Tokio-Tracing with OTel Tracing specifications could mitigate this concern but necessitates groundwork to identify gaps and propose solutions. Tokio-Tracing maintainers have shown willingness to accommodate reasonable changes, pending a clear set of requirements. This option does not eliminate the OTel Tracing API completely, but it'll still remain to compensate for things missing from Tokio-Tracing - only those APIs which are overlapping/competing with Tokio-Tracing needs to be deprecated/removed.

  3. Maintain Both APIs: This alternative emphasizes the importance of ensuring seamless interoperability between the two APIs, allowing users to choose based on preference or specific needs without compromising trace completeness. Achieving this goal requires significant effort to identify and bridge any existing gaps in the interoperability story. Users should be able freely chose between, without worrying about any broken traces.

  4. Do nothing.: OTel Rust has some special accommodations done to help tracing crate (and vice-versa). We can just remove them, and let each crate follow their own destiny. (Highly undesirable state, just listed for completion)

Are there more options? Please let us know in the comments!

Current State

The Rust tracing ecosystem is at a critical juncture. Active discussions between the OTel Rust team and the Tracing Rust team are taking place, with updates and deliberations shared on Cloud Native Slack. Interested individuals are encouraged to join the discussion on Slack (or right in this Github issue). All decisions and considerations will be posted on GitHub as well for wider visibility and to gather feedbacks.

Timeline

Resolving this issue is a prerequisite (though not the only one) for declaring the Tracing signal as GA (General Availability) for OTel Rust. Given the goal to achieve Tracing GA (alongside other milestones) soon, it's crucial that this issue is resolved promptly. A tentative deadline to reach a decision on the chosen path forward is set for April 30th, 2024, approximately 2 months from today.

Related issues

1378 Tracing Propagation.

https://github.com/open-telemetry/opentelemetry-rust/pull/1394#discussion_r1406501176 Broken Trace example : https://github.com/open-telemetry/opentelemetry-rust/issues/1690

cijothomas commented 6 months ago

Tagging @open-telemetry/rust-approvers @jtescher as tracing-opentelemetry maintainer @davidbarsky as tracing maintainer

TommyCpp commented 6 months ago

If we were to use tracing as the API. This is the deviation between existing tracing API and Otel tracing API

hdost commented 6 months ago

From the metrics perspective exemplars are also something to take into account.

hdost commented 5 months ago

As requested in the community meeting: I am partial towards Option 2. Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

I would like say that we should probably try to see if it's not possible to improve the inter-compatibility as people will still try to use it directly.

Questions that are open from my perspective:

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Update: I think really I'd be more 3 than 2. If we can promote inter-compatibility between the two then I think that's a greater win for the community at large. Because as I mentioned during the meeting we will still need to have "some" API anyway.

ramosbugs commented 5 months ago

I am partial towards Option 2. Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

As a heavy user of direct OpenTelemetry instrumentation (e.g., using SpanBuilder a lot, along with span events and span links) in the backend of my AWS Lambda-based web app, I'm nervous reading this. There are almost 800 calls to set_attribute alone in my codebase, and moving off of a deprecated API used this heavily would be a major undertaking.

Which interfaces specifically would be deprecated?

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

I suspect a lot of other OpenTelemetry users are also doing so in private repositories, so I agree that it's hard to measure. I would caution against inferring much from these public GitHub usage stats.

TommyCpp commented 5 months ago

Which interfaces specifically would be deprecated?

We don't know exactly yet. The idea is to bridge the gap between the tracing and OTEL spec using custom API but if something tracing supports we want to deprecate in favor of that. But we are still debating ideas and nothing has been decided yet. Thus, any feedback is greatly appreciated!

lalitb commented 5 months ago

I vote for option 2, as there are challenges with other options:

Going with Option 2, we also need evaluation for introducing an extension API within OpenTelemetry. This is to effectively bridge the existing gaps between the OTel specifications and Tokio-Tracing's functionalities (e.g, Baggage support, Propagators).

lalitb commented 5 months ago

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Direct consumption of opentelemetry-api could be for traces, metrics and logs, and I agree it is really hard to get the actual statistics for "traces" only :)

julianocosta89 commented 5 months ago

OpenTelemetry comes from OpenCensus and OpenTracing merge. The deprecation of those 2 parent projects took some time, but it happened.

IDK if I have a saying because I don't maintain the OTel Rust, but I'd vote for Option 1 and invite the maintainers of tokio-tracing to join the OTel project as maintainers/approvers. Basically continue what they are doing, but under CNCF umbrella and OpenTelemetry as main project.

IDK how much tokio-tracing follows the OTel specification and semantic convention, but another thing to highlight is that whenever we have the 3 signals stable in OTel Rust, we would have 2 different approaches for telemetry in Rust.

I'm biased but I see OTel as the future for Observability signals.

lalitb commented 5 months ago

Tagging for more inputs. @hawkw, @davidbarsky, as tokio-tracing maintainers @jtescher as tracing-opentelemetry, and opentelemetry-rust maintainer

TommyCpp commented 5 months ago

but another thing to highlight is that whenever we have the 3 signals stable in OTel Rust, we would have 2 different approaches for telemetry in Rust

Just to provide context here. I think if we move to tracing as the API for Otel Rust. We could unify all 3 signals under tracing.

jtescher commented 5 months ago

I haven't had much time recently to work on open source, but my perspective is that option 3 is likely optimal in the near term. I suspect that expressing the full otel API via tracing would be difficult and likely require some changes to the underlying library which would be orthogonal to their current designs and goals (trace ids on span creation, metrics in general, etc). It may be possible to express them via a large range of special fields but it seems likely that it would be worse than the current two API confusion. Someone could do a proof of concept trying to unify them though to be sure.

Option 3 could be done via clearer purposes for each API (e.g. low level "full" api via otel, or high level "limited but ergonomic and user-friendly" api via tracing macros) and examples of suggested architectural patterns (e.g. otel API "between" application boundaries, tracing within applications and crates, or similar sets of suggestions). But as already mentioned here it is somewhat more cumbersome and confusing than a consistent single API. Being not fully in control of the otel spec or the log/tracing ecosystems means the rust otel stack finds itself somewhat stuck in the middle.

davidbarsky commented 5 months ago

I'm on vacation, so I'll be brief and try to expand next week/summarize my thoughts from Slack: pulling a .NET (paying attention to the intent, not the letter of the spec) is very much possible, down to the fact that propagators remained in a dedicated OTEL library for 2.5 years. I think tracing could have a native notion of propagators, but I don't think we have the bandwidth to figure that out/I'd rather wait for Tower to reach 1.0 before making decisions on that front.

cijothomas commented 5 months ago

down to the fact that propagators remained in a dedicated OTEL library for 2.5 years

In practice, this is still the case! So is Baggage. OTel .NET still maintains the API for these things, that are not covered by the .NET Runtime API. If we go with option2 here, I'd expect that we'll only eliminate those APIs for which there is a clear equivalent in tracing.

I'll also be on vacation for ~1 week. Once back, I'll write down more details on how option2 could potentially look like. I didn't want to spend too much time on exploring any of the options, without observing which one the community as a whole would lean to.. It does not look like there are any clear winners so far, but part of the reason could be due to lack of specifics/details on what would each option really entails.

I'm not yet in a position to strongly support any option so far, however, I'll take a stab at exploring option 2 further.

hdost commented 5 months ago

I guess I'll try to take a look at how we could go for option 3.

From the top of my head use cases to look at:

Then some variant of the two where both tracing and Otel are used for instrumentation.

Those will be "advanced cases", but honestly it might be more common than one might think.

cijothomas commented 5 months ago

Comment/Discussion from Community Meeting for option3:

Test to validate the option3 A -> B -> C

A - uses tracing for producing span B - uses otel tracing api for producing span C - uses tracing for producing span

3 spans SpanA SpanB (parent=SpanA) SpanC (parent=SpanB)

It may not be feasible to ask users to use same api for all 3, as they may not own/control some of them. eg: B could be reqwest crate.

https://github.com/open-telemetry/opentelemetry-rust/issues/1378#issuecomment-1815168635 shows an examples where logging and tracing (distributed tracing aka spans) are used, and correlation is broken when tracing crate is used to produce span, instead of otel tracing api.

cijothomas commented 4 months ago

down to the fact that propagators remained in a dedicated OTEL library for 2.5 years

In practice, this is still the case! So is Baggage. OTel .NET still maintains the API for these things, that are not covered by the .NET Runtime API. If we go with option2 here, I'd expect that we'll only eliminate those APIs for which there is a clear equivalent in tracing.

I'll also be on vacation for ~1 week. Once back, I'll write down more details on how option2 could potentially look like. I didn't want to spend too much time on exploring any of the options, without observing which one the community as a whole would lean to.. It does not look like there are any clear winners so far, but part of the reason could be due to lack of specifics/details on what would each option really entails.

I'm not yet in a position to strongly support any option so far, however, I'll take a stab at exploring option 2 further.

Took some time to get to this due to other priorities, but here are more details on one possible way to go with option2, including a prototype: https://github.com/open-telemetry/opentelemetry-rust/issues/1689

diurnalist commented 4 months ago

👋🏻 I am not a Rust developer so am coming from a very different perspective. My take is that, to my knowledge, every other ecosystem has opted for Option 1 long-term, Option 3 near-term. Specifying the API in OTel was (I assume) a large effort and we have seen the API evolve as developers have battle-tested it and provided feedback (e.g., lack of a synchronous gauge instrument, which is now in the spec.) My impression is that spec evolution is a pretty collaborative process, which is nice to observe.

In my view it would be a mistake to align on pre-existing instrumentation conventions as OTel's mission has been to provide a standard API that instrumentations across languages/systems can adhere to. This is particularly important as it provides a path for libraries to provide instrumentation hooks to, e.g. automatically generate traces and metrics as part of their own business logic, kind of like bpf kernel tracepoints or UDST. And those hooks are written according to a wider specification and hence less vulnerable to governance issues that tend to come up in external libraries from time to time.

In the Go OTel SDK there are several "bridge" interfaces that help to close the gap b/w the OTel API and existing instrumentation libraries, e.g., the opencensus bridge. Perhaps this would be a way to pave the path towards wider OTel API adoption.

/$0.02 🙇🏻

cijothomas commented 4 months ago

As requested in the community meeting: I am partial towards Option 2. Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

I would like say that we should probably try to see if it's not possible to improve the inter-compatibility as people will still try to use it directly.

Questions that are open from my perspective:

  • What are the minimal set of features we feel like we'd want to maintain?
  • Are the tokio team amenable to supporting such features?

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Update: I think really I'd be more 3 than 2. If we can promote inter-compatibility between the two then I think that's a greater win for the community at large. Because as I mentioned during the meeting we will still need to have "some" API anyway.

@hdost After re-reading this, I am not entirely sure if I understand the part where you said "I mentioned during the meeting we will still need to have "some" API anyway" I think @TommyCpp also mentioned this (in metrics context though).

If you look at the prototype, it has tracing sdk only! No tracing api. i.e there is nospan/span.start()/end() etc. We'll need opentelemetry crate itself, where we need to expose APIs for things not covered by tokio-tracing. Eg: Baggage, Propagators, Metrics, LogBridge etc. https://github.com/cijothomas/opentelemetry-tracing/blob/main/src/opentelemetry_sdk.rs

Could you check this. We can discuss in the next SIG call and figure out what are the gaps in our understanding.

mladedav commented 2 months ago

I've stumbled upon this issue about an issue with tracing and OpenTelemetry API compatibility. I didn't test if it is still current, but it might be another data point to consider.

Just checking, was some kind of decision made anywhere? The original timeline aimed for end of April which has already passed, but I didn't attend the weekly calls if it was discussed there.

cijothomas commented 2 months ago

I've stumbled upon this issue about an issue with tracing and OpenTelemetry API compatibility. I didn't test if it is still current, but it might be another data point to consider.

Just checking, was some kind of decision made anywhere? The original timeline aimed for end of April which has already passed, but I didn't attend the weekly calls if it was discussed there.

Its similar/same as https://github.com/open-telemetry/opentelemetry-rust/issues/1690 ?

No decision is finalized. I have done initial exploration of option 2 here : https://github.com/open-telemetry/opentelemetry-rust/issues/1689 .We discussed some ideas even in yesterdays' community call as well. @TommyCpp is further exploring this approach to come up with a list of issues (along with severity - nice-to-have vs blockers) that we need to discuss with tracing owners.

We are really short on manpower, especially people with experience in tracing + opentelemetry side. If you can also help, that'd be super helpful!

mladedav commented 2 months ago

Its similar/same as #1690 ?

You're right, it seems to be the same issue.

I'll see if I can spare some time but honestly I don't have that much experience with OpenTelemetry itself.

cijothomas commented 2 months ago

Its similar/same as #1690 ?

You're right, it seems to be the same issue.

I'll see if I can spare some time but honestly I don't have that much experience with OpenTelemetry itself.

See if you can join the community meetings. It is 9 AM PT (Tuesdays). If the timing does not work, happy to discuss in separate calls. (you can reach out to me/other maintainers on slack/discord as well)

This is the most foundational problem that needs to be resolved in this repo, but unfortunately, it is somewhat a hard problem, and we also lacks manpower :(

cijothomas commented 1 month ago

Update from July 30 OTel Rust Community Meeting:

We recognize it’ll be a while before this can be fully sorted out. We continually see issues - upgrades are hard, otel-demo is broken, and users are unsure which versions are compatible and the list goes on.

To mitigate the short/medium term pain, while also being not-too-far from the long term plans, it was decided to offer tracing integration in the OpenTelemetry-SDK itself under a feature flag. This feature will start recognizing tracing Spans without the need for bridging via tracing-opentelemetry. In short, a lot of functionalities from tracing-opentelemetry gets absorbed into the SDK itself. This is something that'll be needed in option 2 and 3 anyway, so this is not drifting off too far from 2,3 which are currently the leading candidates.

Note that this does not support interoperating both APIs for spans - either use tracing or otel tracing api, but mixing them up won't work. If option 3 is settled on, then this will need to be solved, but not part of the immediate release.

This does not deprecate tracing-opentelemetry (We don’t own it to deprecate), but once the above is ready, tracing-maintainers can decide if they want to deprecate tracing-opentelemetry. And this does not deprecate otel-tracing either.

@TommyCpp will make the above happen and we are targeting to include it in the next release (~Aug 30)

cijothomas commented 3 days ago

Update from July 30 OTel Rust Community Meeting:

We recognize it’ll be a while before this can be fully sorted out. We continually see issues - upgrades are hard, otel-demo is broken, and users are unsure which versions are compatible and the list goes on.

To mitigate the short/medium term pain, while also being not-too-far from the long term plans, it was decided to offer tracing integration in the OpenTelemetry-SDK itself under a feature flag. This feature will start recognizing tracing Spans without the need for bridging via tracing-opentelemetry. In short, a lot of functionalities from tracing-opentelemetry gets absorbed into the SDK itself. This is something that'll be needed in option 2 and 3 anyway, so this is not drifting off too far from 2,3 which are currently the leading candidates.

Note that this does not support interoperating both APIs for spans - either use tracing or otel tracing api, but mixing them up won't work. If option 3 is settled on, then this will need to be solved, but not part of the immediate release.

This does not deprecate tracing-opentelemetry (We don’t own it to deprecate), but once the above is ready, tracing-maintainers can decide if they want to deprecate tracing-opentelemetry. And this does not deprecate otel-tracing either.

@TommyCpp will make the above happen and we are targeting to include it in the next release (~Aug 30)

This work is delayed, and won't be part of the coming release (expected in a day). Will post new ETA for this soon.