open-telemetry / opentelemetry-java

OpenTelemetry Java SDK
https://opentelemetry.io
Apache License 2.0
2.02k stars 839 forks source link

remove grpc-context dependency from the api package #575

Closed codefromthecrypt closed 4 years ago

codefromthecrypt commented 5 years ago

I understand why opentelemetry has a bias towards google's gRPC library. This codebase primarily borrows from work that started internally, became google's instrumentation-java, later renamed to opencensus, and now mostly copy/pasted from that. The original primary customer was not just gRPC protocol, but specifically google's library. For example, gRPC is the most primary user of the prior work, and it is helpful to have a user to guide.

Time has passed, though, and what started as an end-to-end instrumentation library has split into a Api + SDK approach, as facilitated by the weaving in of OpenTracing. Yet, code opinons towards google's gRPC implementation still exist in the form of a strict dependency in the api package, and helper code.

While it is understandable that alliances exists, and having good integration with Google's gRPC implementation is important, even gRPC itself has options. For example, Armeria has gRPC protocol support without the need of Google's library. We have already encountered problems where other google projects drag in gRPC because of Census. This should stop now. OpenTelemetry as a body of people, sure that can advocate for Google. However, the instrumentation api must not pick winners and losers in RPC frameworks.

Please remove all traces of grpc-context from the api library and move the support functions to a context integration package.

This is a follow-up from here: https://github.com/open-telemetry/opentelemetry-java/pull/571#issuecomment-536193532

codefromthecrypt commented 5 years ago

here are some related issues from grpc-context:

codefromthecrypt commented 5 years ago

It may also not be obvious why removing it from the api package is important besides bloat. grpc-context is not even an optional dependency. strict dependencies not only cause bloat and pin versions, but they also obscure what apis are actually powering the software. This means it hides abstraction leaks. For example, you can't tell if an instrumentation site requires grpc-context or not by looking at its dependencies: you have to scan all of its code.

TL;DR; Currently, anything marked as opentelemetry-api might as well be assumed to only work when grpc-context is present.

$ ./gradlew :opentelemetry-api:dependencies
--snip--

compileClasspath - Compile classpath for source set 'main'.
+--- com.google.auto.value:auto-value-annotations:1.6.2
+--- com.google.errorprone:error_prone_annotations:2.3.2
+--- com.google.code.findbugs:jsr305:3.0.2
\--- io.grpc:grpc-context:1.20.0

--snip--

runtimeClasspath - Runtime classpath of source set 'main'.
+--- com.google.auto.value:auto-value-annotations:1.6.2
+--- com.google.errorprone:error_prone_annotations:2.3.2
+--- com.google.code.findbugs:jsr305:3.0.2
\--- io.grpc:grpc-context:1.20.0
codefromthecrypt commented 5 years ago

See also #576 as the SDK package is far worse

trask commented 4 years ago

@carlosalberto @bogdandrutu @jkwatson I'm worried that we went in the opposite direction of removing the grpc-context dependency in 0.3.0, and it is now part of the public API surface in 0.3.0, with OpenTelemetry API users interacting directly with gRPC Context.

Summarizing my thoughts below about the grpc-context dependency, and I will join the Java SIG meeting tomorrow morning so we can discuss in person.

  1. The dependency on grpc-context makes it less clear what the OpenTelemetry API surface really is.

    For example, the OpenTelemetry API's current span can be manipulated directly via gRPC Context. Is this part of OpenTelemetry API surface?

    For another example, the way to create an OpenTelemetry scope listener currently is by creating a class with a very special name io.grpc.override.ContextStorageOverride. This seems like a very odd public API surface that we are exposing. (and what happens if there is already one of these classes on the classpath?)

  2. The dependency on grpc-context causes a particular problem for the auto-instrumentation project.

    We instrument the OpenTelemetry API itself, so that telemetry sent from an application/library to OpenTelemetry API will seamlessly interoperate with telemetry captured by auto-instrumentation. If gRPC Context is part of the OpenTelemetry API surface, we will have to interoperate with all features (and probably multiple versions) of gRPC Context, in order to keep the "application-space" gRPC Context in sync with our embedded SDK's gRPC Context.

  3. I think we will have better success getting popular libraries to emit telemetry using opentelemetry-api if it does not pull in grpc-context.

    Many libraries are super conservative with introducing new dependencies. I already heard resistance from the Azure SDK team this week when they went to upgrade to 0.3.0 and saw the hard dependency on gRPC Context (see https://github.com/Azure/azure-sdk-for-java/issues/9695). I don't think this will be an uncommon concern.

tedsuo commented 4 years ago

I would prefer this dependency be removed. I believe the gPRC team was open to moving to a shared context dependency. I feel this would be the right move - both projects sharing a stand-alone context package. This was one of the intentions behind the move to separate context propagation from tracing.

The other approach would be to ask the grpc team to take on an opentelemetry dependency.

@bogdandrutu I know you were interested in this, and already had it on your radar. Is now the time?

Oberon00 commented 4 years ago

See https://github.com/open-telemetry/opentelemetry-java/pull/720#discussion_r371776798

trask commented 4 years ago

We should also consider Project Reactor's Context, which is used by Spring WebFlux.

Project Reactor and Spring WebFlux are also very popular projects in the Java world.

So I don't think that gRPC Context is the "clear winner" in the context propagation space, which seems to be the requirement for us to pick an external library for context propagation.

And I love the idea of us working with gRPC, Project Reactor and others to help create a "clear winner" in the Java context propagation space that we can all take a dependency on.

But in the meantime, let's replace gRPC Context with our own Context implementation, and let's see if it's possible to avoid exposing our Context object directly in our public API surface so that we can swap out our Context implementation with a "clear winner" some time in the future.

codefromthecrypt commented 4 years ago

copying in @carlosalberto who seems to be the one who decided to make a hard dependency based on bogdan's advice.

Note: I think the maintainers of this repo should be revisited, as it should really be led by folks who are conservative about things like this by default.

codefromthecrypt commented 4 years ago

by revisit, I mean that whoever is a maintainer acknowledges that they won't add any hard dependency on a 3rd party library for paths required for usage. And especially this point..

when you do decide to do the opposite of a pending issue, you must comment on that issue. Avoiding conflict by not doing that isn't good for the project as it leads to surprise.

jkwatson commented 4 years ago

Hi @adriancole . Newly minted maintainer here, trying to make sure this gets addressed. I wholeheartedly agree that we need to minimize dependencies in the API layer, and that removing grpc-context from the API surface is absolutely our goal. Also, getting rid of grpc-context as a hard API dependency is also needed.

The other desire, I believe, is to have a battle-hardened context implementation in opentelemetry, so that we can launch with it.

Do you have a constructive suggestion on what we should be using for our context implementation?

I can see several options: 1) bring in an existing production-ready, appropriately licensed context implementation into the codebase as source-code, repackaged under an opentelemetry namespace. 2) build a brand-new context implementation into opentelemetry, which isn't production-hardened. 3) design an context abstraction, with no implementation, for the API, continuing to use a wrapped grpc-context as the implementation in the SDK.

I'm sure there are other options. I'd love to hear what your approach would be.

codefromthecrypt commented 4 years ago

@jkwatson thanks for asking! honestly, this is a good turn in itself that you are asking outsiders advice. bravo! Here's some experience from the zipkin brave project.

separate api for sure. you will need to make some decisions, particularly if it is immutable only. only add the apis you need, nothing more. here are some hints

You will need an implementation plug-in that forwards minimally to a stacked thread local, but could also plug into other context storage apis. For example, gRPC has one (so does finagle). Armeria recently designed one also. Note.. some context storage is request scoped!

codefromthecrypt commented 4 years ago

@jkwatson since I have your ear, I'll give you unsolicited feedback on how to make friends with other projects. Some preceding work ripped off designs or even large amounts of code copy/pasted with no thanks or attribution. This really made people mad. When you do make a design or implementation, please in the PR thank them, and also cite somewhere in at least comments if not javadoc or a RATIONALE the inspiration. You can turn this culture around.

jkwatson commented 4 years ago

So, it sounds like your general approach would be to build something new, rather than work with something already battle-hardened?

codefromthecrypt commented 4 years ago

@jkwatson I assume you mean grpc context? It isn't a separate library, and also the feature scope is broader in some ways and more narrow than other context libraries. I think the api surface should be not tied to grpc context, but if the community decide to make that the default implementation they could (as long as the library is extracted, has no deps and doesn't change the JDK requirements). Basically there's still a decision about the api surface area even if you use gRPC by default, there are a lot of api decisions nested in grps, like timeout listeners etc that may not be appropriate here, and you also have to consider most RPC libraries control their own context and are not using grpc underneath.

hope this helps.

codefromthecrypt commented 4 years ago

the best way out is to be diverse in testing.. ex don't only test interop with grpc. test interop with several libraries (ex reactor-netty, armeria, non-rpc). This will show what I'm telling about library bias.

codefromthecrypt commented 4 years ago

here's a really old discussion that might be interesting https://github.com/DistributedTracing/continuation-local-storage-jvm/pull/1

also, if you need more wrenches for the clockwork, ask someone about akka :)

codefromthecrypt commented 4 years ago

Last unsolicited.. I might be wrong, but I think grpc has a bias towards static initial context lookup.

please do not expect everything to be static lookups. this was a problem in previous designs to make everything static registries.. "static only" is not an appropriate solution for DI or modular (OSGi environments) even if it can make sense for agents.

I can't find it now, but @jkschneider had a neat google doc or something on how micrometer allows choice of static or injected registries.

jkwatson commented 4 years ago

@jkwatson I assume you mean grpc context?

No, I was just trying figure out between the 2 BIG options: build something new, or use something that already exists, with proper attribution, etc.

carlosalberto commented 4 years ago

Hey @adriancole

No need to freak out here, as we are still before 1.0 and this issue has been closed ;) But thanks for the additional information. There have been interesting reasons on the advantages and disadvantages of grpc.Context, and we kept the dependency for now, while we figure out the way to go (either keep it or abstract in a specific way).

My initial prototype considered abstracting this part, and my slight personal preference would be to make grpc.Context and independent module/artifact that, so it would become the de facto context solution for Java (something similar to what contextvars is for Python).

In any case, as John suggested we will evaluate our options now. But definitely this will happen before 1.0 ;)

jkwatson commented 4 years ago

Document being used to capture a more dynamic discussion: https://docs.google.com/document/d/1dKXHYH1CQm4IQuRl5NZaoh2vETjRlBfhcsciz_Xs5hQ/edit?usp=sharing

codefromthecrypt commented 4 years ago

pre 1.0 libraries when marketed to masses end up in people's code, even wise people's code sometimes. be careful is all. this is a serious problem.

carlosalberto commented 4 years ago

pre 1.0 libraries when marketed to masses end up in people's code

Disagree on this point, at least for OTel - observe we just released beta a few days ago, and as such, the term beta should signal people they can still expect changes.

Lets definitely focus on the technical discussion. So I suggest commenting on the document, so a decision can be done with all the advantages/disadvantages of each approach taken into account. I have no strong attachment to any alternative, so I'm open to what is agreed ;)

Oberon00 commented 4 years ago

pre 1.0 libraries when marketed to masses end up in people's code

Can confirm. Planning with the initial release dates + some not generous enough buffer can lead to that. EDIT: To be clear, this was not entirely unexpected so it's not that bad; we are not re-exposing any interfaces and are prepared for breaking changes / sticking with old versions for some time.

codefromthecrypt commented 4 years ago

I noticed my notes didn't move into the doc so I added them

Knowing opentracing was pushed with marketing and its final version was 0.33.. and that census also was a dep of grpc and also google's http library also <1.0, I think folks dismissing the concern about 1.0 aren't maybe taking this seriously enough.

codefromthecrypt commented 4 years ago

the google doc seems to have a lot of active doubt that there isn't enough manpower etc to do context. Speaking from experience on a project which has never had more than one fulltime engineer, and most integrations are done by volunteers, I would pretty much say don't limit yourself.

Here's an example of integrating with an existing context using a bridge. there are edge cases around this, pros and cons, what to do if there's no request (ex fallback to a thread local?), but if a volunteer project can do this, so can open telemetry. give it a shot!

https://github.com/line/armeria/blob/master/brave/src/main/java/com/linecorp/armeria/common/brave/RequestContextCurrentTraceContext.java

jkwatson commented 4 years ago

@bogdandrutu @carlosalberto I'd love to have a proposal by the end of this week's SIG meeting on Friday, or at least a plan of action to test various options. Is there additional information/feedback that you'd like before you feel like we're in a good position to make that happen?

andrewhsu commented 4 years ago

From the SIG mtg today, sounds like more analysis after going through the gdoc:

But from the conversations sounds like this issue is super desirable to be solved before GA.

codefromthecrypt commented 4 years ago

I'd like to see an explicit answer on this rather than rumors. If the decision is to depend on grpc-context, even if it is renamed, it will limit integration points in favor of grpc.

That the tradeoffs were listened to and understood should be documented in a RATIONALE that lives in this project if that was the outcome @bogdandrutu (mentioning you specifically as you are the most vocal in having this dependency)

codefromthecrypt commented 4 years ago

This decision is one of the most important decisions there is in this project as it limits or opens applicability in few ways except possibly a JRE version choice could. Propagation is the heart of tracing (and also metrics to the degree context tags are used)

A section in the RATIONALE saying "Here's why we use grpc-context" can reduce frustration, as no one would bother lobbying to decouple it or tentatively writing code wondering what will happen next. Moreover, folks who are told this is the "one api" can get a more honest assessment of where the "one" part starts and stops. This is open source after all, let's be open about this.

Even if the answer is grpc-context or some renaming of it, I presume agents would be less limited by this choice, for example, than tools that cannot affect the class loader (ex those statics). This isn't to say it is 100% wrong, just this is such a huge decision it must be transparent both with the pros and the cons, and not in google doc form.

trask commented 4 years ago

Hey @adriancole! Sorry for the lack of updates here. @bogdandrutu and I have been discussing, and spending time trying to figure this out. I could see this being worth a weekly 30 min meeting for a few weeks while we are working through ideas/options, would you be able to join if we schedule something that is timezone friendly for you?

codefromthecrypt commented 4 years ago

please just update this issue as you have updates, or point to something that is updated that I can watch? I'm not able to commit to synchronous meetings, but I can follow offline updates if it is known where they are.

trask commented 4 years ago

Should have something that's worth reviewing in the next day or two. Will post back here.

trask commented 4 years ago

Please review prototype over at https://github.com/propagationio/context/pull/1

codefromthecrypt commented 4 years ago

@trask you might consider reverting the src tree of the initial commit and then re-raising it as a PR. There are many decisions that predate the first pull request

codefromthecrypt commented 4 years ago

also, in that PR please have someone from grpc review it as it seems to be the idea that grpc would adopt this. anything that implies a new base class would imply its usefulness will be limited to adoptees, which was the problem in the initial discussion mentioned here https://github.com/open-telemetry/opentelemetry-java/issues/575#issuecomment-609169100

the points I've mentioned in the past about a base lib dependency having limitations are worse if no-one depends on it. right now, at least grpc does. get me?

trask commented 4 years ago

@adriancole The initial source tree was mostly done by @bogdandrutu, I believe it is just gRPC minus cancellations, did you see some other decisions that you wanted to discuss? @bogdandrutu is discussing alignment with the gRPC team.

codefromthecrypt commented 4 years ago

I'll try to put this a different way. the code seems lifted from grpc, which itself had years of decisions made. Which decisions are retained into this codebase are themselves decisions.

This project forked opencensus, but copied in its source (allowing review) piecemeal. For example, the "initial commit" here was just a license file. it would seem a similar policy would make similar sense for propagationio which if achieves its goal has an even wider scope than this project.

Also, this is far bigger than just me and bogdan. I like that you are keeping me up to date, but how could any diverse review happen if commits are decided before anyone even knows the project exists?

making more sense now?

trask commented 4 years ago

That's fair.

So maybe it's premature to even have a github repo at this point then.

What I'd like to accomplish initially, is to see if this idea even has legs, which is part technical feasibility and part inter-project discussions (Bogdan has reached out to gRPC and I've reached out to Reactor).

Maybe it's better if I just put code under my own github account for these initial discussions?

How would you suggest going about this?

Thanks, Trask

codefromthecrypt commented 4 years ago

What was done here is a good way to start https://github.com/open-telemetry/opentelemetry-java/issues/575#issuecomment-609169100 You'll notice in the associated PR there's a bunch of chatter. You can ping those in your initial commit PR. I think it would be better to use wherever the repo intends to be as that keeps comment history. That said repos can be moved.

trask commented 4 years ago

Sounds good.

I will reset the repo tomorrow.

Do you think PR #1 should be only what's currently in the initial commit, or ok to also include what's currently in PR #1, in order to give reviewers a better picture of the proposed project direction?

codefromthecrypt commented 4 years ago

my suggestion would be that the pr should include only the most important core functionality (not the project setup which can be a scaffolding commit that's more bikeshed in nature)

There should be a rationale file to explain why the choices are there (helps conserve time of others and also with ESL). the test tree could have a features directory showing why certain things exist. If they don't have a why test, probably not worth putting in the codebase.

Also, I'd hesitate about adding all utility functions (like runnable wrappers) in the first PR for the same reason. Ex executor decorators etc unless they are actually powering something needed in the first feedback round.

In the PR description you can say that for mutual energy conservation, this won't include all features, and maybe just note the api features that are planned to be there by the end of it.

For example, a core assumption is that other libraries might use this. You could add integrations for some common things not just grpc ex https://github.com/openzipkin/brave/blob/master/brave/src/test/java/brave/features/finagle_context/FinagleContextInteropTest.java

I suspect there are likely less than 5 total apis you'd want on that type anyway.

ex in brave, there's only

  public abstract Scope newScope(@Nullable TraceContext currentSpan);

  public interface ScopeDecorator {
    Scope decorateScope(@Nullable TraceContext currentSpan, Scope scope);
  }

everything else is compositional on those.

hope this helps

trask commented 4 years ago

Still trying here 😆. Check out #1189.

Oberon00 commented 4 years ago

I think this issue is more serious than I initially thought. See my comment at https://github.com/open-telemetry/opentelemetry-java/pull/1281#discussion_r431768732:

I don't think we want enforce versions for all grpc components for all artifacts. But probably this is a (serious) issue that will only be resolved with #575. That is, even though we "only" depend on grpc-context, we force every user of the API to only use grpc versions (including other artifacts) that can work together with our grpc-context version. BTW, if grpc also depends on the proposed "propagation.io" context, we will still have the same problem.

trask commented 4 years ago

Interestingly, the specification is pretty clear on what should be done:

In the cases where an extremely clear, pre-existing option is not available, OpenTelemetry MUST provide its own Context implementation.

jkwatson commented 4 years ago

I have mostly not voiced much opinion on this topic because I have felt that I didn't know enough to have a firm opinion. However, at this point, my opinion is this:

We should have a java-interface-based (non-static-method-oriented) context API. I think it should probably be done outside the project, like the java-context prototype and consumed by this project.

I think it would be a mistake to bake in static-method-only APIs, as they are the most limiting choice you can make and force every user of the API to conform to that model.

There is no answer here that will make everyone happy, or satisfy every use-case, so we should choose the approach that is the most modern java-idiomatic, and that is, without any question, to base the API on a java interface, and thus support any dependency-injection use-cases. grpc-context, or something shaped the same way might be the "easiest" choice, but I don't think it's the right choice for this API.

yurishkuro commented 4 years ago

I agree, in principle. However, the implication here is that methods like tracer.startSpan(), which is currently documented as "make the new span a child of active span if the latter exists", become impossible unless the context object is passed in as an argument.

Oberon00 commented 4 years ago

The context manager could be stored by the TracerProvider.

yurishkuro commented 4 years ago

@Oberon00 it would still limit the implementation to some form of thread-local storage, which is especially annoying to teams who make a choice to use explicit context propagation.

jkwatson commented 4 years ago

Yes, as @Oberon00 says, if we make the storage explicit, rather than implicit, then the SDK can be configured with the storage, and the SDK Tracer Provider can have the storage and provide access to the context for the tracer, as appropriate.

I'm not suggesting that we don't have storage somewhere; just that we eliminate the need to ever have to call Context.current() to get the current context, and instead make it state held (probably) by the tracer/provider.

carlosalberto commented 4 years ago

The extra layer sounds like something nice to play with, but honestly I don't see any general-purpose library that support this, other than grpc-context.

Unless we want to de-facto do this to specifically help Reactor/Armeria applications hook up their context layer when their, uh, applications run. That, or have Zipkin or other APM vendor implement their very own SDK, but I don't imagine this being the case in the near future?