open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.75k stars 888 forks source link

Drop support for jaeger-thrift #2859

Closed breedx-splk closed 1 year ago

breedx-splk commented 2 years ago

I'd like to propose that we drop support for the jaeger-thrift exporter. The upstream repo has been marked deprecated and they're actively encouraging folks to use OpenTelemetry instead. They're also refusing security patches.

Furthermore, the docs indicate

Since v1.35, the Jaeger backend can receive trace data from the OpenTelemetry 
SDKs in their native OpenTelemetry Protocol (OTLP)].

Overall, I think this just makes sense and reduces an overwhelming number of choices for end users. Also curious what @pavolloffay thinks.

breedx-splk commented 2 years ago

Related: https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/1973

jkwatson commented 2 years ago

We're currently using this at Verta, so I would prefer that we don't drop support for it quite yet.

breedx-splk commented 2 years ago

Yeah I'm proposing that we keep it in the splunk distro until we deprecate the Smart Agent, which won't happen until December, but I figured upstream could get ahead of that (or that we'd need this very discussion 😁 )

jkwatson commented 2 years ago

Yeah I'm proposing that we keep it in the splunk distro until we deprecate the Smart Agent, which won't happen until December, but I figured upstream could get ahead of that (or that we'd need this very discussion 😁 )

👍🏽 It won't be hard for us to switch over to using grpc, rather than thrift for our jaeger exporter; I just need some lead time to make sure it gets on the schedule.

pavolloffay commented 2 years ago

No objections from the Jaeger community.

We recommend people to use Jaeger gRPC and/or switch to OTLP.

anuraaga commented 2 years ago

@pavolloffay I have heard from some users that while gRPC is the standard for Jaeger collector, UDP+Thrift is still idiomatic for Jaeger agent so they try to use it. I guess the jaeger-agent at least doesn't document supporting gRPC

https://www.jaegertracing.io/docs/1.35/deployment/#agent

Is it safe to say that we don't support the Jaeger agent with OTel SDKs and they must switch to using the OTel collector as an agent?

(Note for context, our README mentions "Jaeger over Thrift Over HTTP", but while it may not have been intentional, we support UDP by accepting ThriftSender)

pavolloffay commented 2 years ago

I guess the jaeger-agent at least doesn't document supporting gRPC

Correct, Jaeger agent does not support gRPC nor thrift over HTTP (the thrift exporter in this repo https://github.com/open-telemetry/opentelemetry-java/tree/main/exporters/jaeger-thrift)

I didn't know a sender can be configured https://github.com/open-telemetry/opentelemetry-java/blob/main/exporters/jaeger-thrift/src/main/java/io/opentelemetry/exporter/jaeger/thrift/JaegerThriftSpanExporterBuilder.java#L26 I don't know if people are using this API.

@breedx-splk what is the motivation to remove jaeger-thrift? Is there any technical reason or just drop it because Jaeger clients are deprecated?

anuraaga commented 2 years ago

It may not have been SDK usage, I think I was remembering this ask for support in the javaagent

https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/5821

There does seem to be one OSS usage

https://github.com/xkazxx/jaeger_demo/blob/e4e04be299e923beeb0434a5989775fa264ec4b7/src/main/java/com/xkazxx/jaegerdemo/config/JaegerDemoWebConfigure.java#L27

In general, even though the use of the deprecated dependency isn't ideal, it looks like the final release was made 3 days ago, so it's not an ancient library yet. Without a known security issue, I'd still err towards keeping what we have for now without a more proactive push for removal. A proactive push for removal would probably mean

1) Removing Jaeger Thrift from spec compliance table https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md#exporters 2) Documenting that OpenTelemetry doesn't officially support Jaeger thrift in the jaeger exporter spec, or at least that it is fine for SDKs to ignore it. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk_exporters/jaeger.md 3) (Maybe) Jaeger deprecating jaeger-agent officially, or otherwise adding text to their document suggesting that OpenTelemetry users should be using the OTel collector, not the jaeger-agent https://www.jaegertracing.io/docs/1.35/deployment/#agent

I'd be supportive of that happening if someone wanted to drive it through ;)

pavolloffay commented 2 years ago

(Maybe) Jaeger deprecating jaeger-agent officially

There is no plan to deprecate jaeger-agent. At least I haven't heard anything about it in the Jaeger community.

The jaeger-java client was released recently to fix some CVEs.

breedx-splk commented 2 years ago

Correct, Jaeger agent does not support gRPC nor thrift over HTTP (the thrift exporter in this repo https://github.com/open-telemetry/opentelemetry-java/tree/main/exporters/jaeger-thrift)

Right, and that repo depends on jaeger-client which is now formally deprecated and the repo has been archived.

@breedx-splk what is the motivation to remove jaeger-thrift? Is there any technical reason or just drop it because Jaeger clients are deprecated?

Yeah, the main thing is that it's a liability. If something comes up with jaeger-client (like a bug or a security fix or whatever) we (opentelemetry) will have no recourse other than forking the archived repo. It also creates a confusing story for the user -- users who land on the jaeger page that says to use OTLP might get confused by the fact that there's a jaeger exporter and configure that into their app. If the exporter doesn't exist any more, it reduces the chances of that happening and improves the OpenTelemetry as the standard.

The jaeger-java client was released recently to fix some CVEs.

I know, one of those was mine...and it was literally the day before the repo was archived.

anuraaga commented 2 years ago

it reduces the chances of that happening and improves the OpenTelemetry as the standard.

I agree with this :) So listed out steps that we should probably take to remove it from OpenTelemetry completely. I think we just need a volunteer to drive that (hint hint :P)

yurishkuro commented 2 years ago

Any objections if I transfer this issue into OTEL Spec repo?

jack-berg commented 2 years ago

Nope!

yurishkuro commented 2 years ago

I'm not opposed to this overall. But to add to what @anuraaga said (https://github.com/open-telemetry/opentelemetry-specification/issues/2859#issuecomment-1273513761):

breedx-splk commented 2 years ago

@yurishkuro thanks for chiming in on this and being open about it. 👍🏻

  • Should there be some public announcement that this deprecation is happening?

Probably? Did you have something specific in mind and/or can we ask the comms SIG for help?

  • I think there needs to be a waiting period after the deprecation announcement before the functionality can be removed from the SDKs, to give organizations time to migrate (changing an exporter in 1000s of services is not a one-day affair)

Yeah, totally agree. I guess another thing to keep in mind is that users would only be facing this exporter change if/when they were upgrading an sdk or agent, which is maybe also not a 1 day affair. Point being, they're already in there making changes.

I guess the next question is then how long? 6 months? 12 months? Something else?

  • it is not clear to me that simply dropping the UDP exporter is the right move. UDP export mode has its own problems, but also has some benefits (e.g. the SDK does not spam your logs in dev simply because no receiver is running). Overall I think UDP mode has more problems than benefits, but it would be good to do some survey of the end users on how they feel about not having the UDP-to-local-agent export path.

Interesting...I hadn't anticipated log spamming to be a consideration here! 🙃 I wonder what a good mechanism to surveying users could be. It looks like the slack channel #jaeger has a lot of users. As with any sizable install base, there are likely to be folks who are reluctant to change for various reasons and we'll have to balance that.

yurishkuro commented 2 years ago

I would go with a tweet from OTEL account pointing to this issue and asking for votes / comments. Then we can decide about the waiting period. Changing from sending data to a local host agent to exporting to a remote gRPC collector could be a significant architectural change that requires establishing some discovery/lb mechanism that the orgs may not have needed before.

breedx-splk commented 2 years ago

@yurishkuro I have met with the comms sig and now I am working on a blog post and will engage in #jaeger channel in CNCF slack as well.

One minor clarification: This issue was initially scoped to only the thrift based jaeger exporters, but #2858 is really about deprecating all flavors. Since that has a larger scope, I will point users there for comments/feedback.

anuraaga commented 2 years ago

significant architectural change that requires establishing some discovery/lb mechanism that the orgs may not have needed before.

Happen to still be seeing this issue so wanted to add one small point - the idea was to replace a jaeger agent with a sidecar OTel collector I think, not a remote one. Switching to remote would be a huge architectural change and probably blocked in a lot of cases, but I think we were hoping that swapping the local host agent binary would not have anywhere near as much friction. Just in case this helps with the comms @breedx-splk

breedx-splk commented 2 years ago

significant architectural change that requires establishing some discovery/lb mechanism that the orgs may not have needed before.

Happen to still be seeing this issue so wanted to add one small point - the idea was to replace a jaeger agent with a sidecar OTel collector I think, not a remote one. Switching to remote would be a huge architectural change and probably blocked in a lot of cases, but I think we were hoping that swapping the local host agent binary would not have anywhere near as much friction. Just in case this helps with the comms @breedx-splk

Thanks @anuraaga, I appreciate that. I don't have a strong sense of how widespread the implementation of the two architectural models are (jaeger agent vs. "direct" ingest). If users currently leverage a Jaeger agent, I suppose they will probably need to switch to a collector....that is, unless the agent also supports OTLP (it sounds like it doesn't?)

yurishkuro commented 2 years ago

This issue was initially scoped to only the thrift based jaeger exporters, but https://github.com/open-telemetry/opentelemetry-specification/pull/2858 is really about deprecating all flavors.

@breedx-splk I still find it acceptable to deprecate all flavors, we just need to give people more time to migrate away.

breedx-splk commented 2 years ago

@yurishkuro On the otel blog post PR, @austinlparker brought up a good question about cross-posting with the Jaeger blog. Any chance you can assist with that? 🙏🏻

yurishkuro commented 1 year ago

@breedx-splk any responses to the survey?

pavolloffay commented 1 year ago

What exactly is this issue deprecating? Is the intent to deprecate all APIs using jaeger.thrift?

There are a couple of APIs using jaeger.thrift:

If that is the case I would maybe prefer to move these APIs to a separate receiver.

yurishkuro commented 1 year ago

I believe the intent is to only deprecate support in the SDK, not in the backend.

breedx-splk commented 1 year ago

What exactly is this issue deprecating? Is the intent to deprecate all APIs using jaeger.thrift? @pavolloffay As @yurishkuro already covered, the intent is to deprecate it in the SDKs. I expect we'd want to leave it in the collector for a while to ease migration but also to support some specific jager-agent use cases (eg. aggregation, sampling strategy).

@breedx-splk any responses to the survey?

(Sorry, was out last week).

So here are the raw results. We had 15 responses between Nov. 2nd and Nov. 21st.

I'll paste screenshots as of today for each question:

image image

On this scale, 1 is Quick and easy and 10 is Slow and labor-intensive

image image

So then the challenge is to turn these results into a deprecation period. About 1/4 of respondents checked "Longer than 12 months", and all of those rated the difficulty as an 8 out of 10.

My brain says 3 months (to mitigate impending issues with the deprecated dependencies) but my gut says 6 months. If we take "zero" to mean 0.5 months and we take 12+ months to mean 12 months, the average across all responses is 5.6 months. (I added this to the sheet).

Looking forward to what other folks think and how they interpret the results.

yurishkuro commented 1 year ago

What is the downside of keeping support through 2023 and perhaps move to contrib in 2024? Are the implementations of Jaeger exporters cloned into OTEL repos or use original Jaeger repos as dependencies?

breedx-splk commented 1 year ago

What is the downside of keeping support through 2023 and perhaps move to contrib in 2024? Are the implementations of Jaeger exporters cloned into OTEL repos or use original Jaeger repos as dependencies?

I can't speak for every language, but at least Java uses the original jaeger client repos as a dependency. I think that's true of at least a couple other languages too. The downside then is that when a vuln is discovered, the instrumentation will have no simple remediation and will need to reimplement the client functionality.

This is a major motivation for the deprecation in the first place.

breedx-splk commented 1 year ago

Would love to get consensus on a number of months so that we can take #2858 out of draft. What do folks think?

I'm still leaning toward 6 months deprecation, with spec removal in July 2023.

Seems like a reasonable amount of time...and even after that , teams can choose not to upgrade.

breedx-splk commented 1 year ago

Ok, #2858 is ready for review.

EricBuist commented 1 year ago

This is a bad idea as Jaeger all in one doesn't accept GRPC at all. I'm searching since more than half an hour, cannot find any accurate information about the GRPC port. Seems for GRPC, we have to deploy the full Jaeger stack as separate services (collector, reporter, and so on), but there is no simple examples of this. Everything has to be read through large documents and guessed through. Yes, removing support for Thrift in favor of GRPC is a good idea, but Jaeger should do the same before.

yurishkuro commented 1 year ago

@EricBuist yes, all-in-one accepts OTLP, and the docs show how to start it to do that:

https://www.jaegertracing.io/docs/1.42/getting-started/#all-in-one