open-telemetry / opentelemetry-dotnet-contrib

This repository contains set of components extending functionality of the OpenTelemetry .NET SDK. Instrumentation libraries, exporters, and other components can find their home here.
https://opentelemetry.io
Apache License 2.0
422 stars 253 forks source link

[ASP.NET Core] GRPC instrumentation support #1726

Open vishweshbankwar opened 7 months ago

vishweshbankwar commented 7 months ago

Issue

https://github.com/open-telemetry/opentelemetry-dotnet/pull/5097 removes support for grpc instrumentation. This is done to unblock stable release of instrumentation library which will contain support for http server instrumentation. Semantic conventions for grpc is still in experimental state and hence the grpc instrumentation cannot be marked stable. https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/rpc-spans.md

Workaround

Going, forward the grpc instrumentation will be supported under an experimental feature flag. https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Instrumentation.AspNetCore/CHANGELOG.md#:~:text=Re%2Dintroduced%20support,enabled%20by%20default.

cijothomas commented 7 months ago

Will this be added back in the next non-stable release? Or as experimental feature in stable?

vishweshbankwar commented 7 months ago

The current plan is to bring back in the next non-stable release (Post GA).

cartermp commented 7 months ago

Why not just wait that extra month? There's no rush to push out all the (massively) breaking changes in HTTP instrumentation. I know of zero end-users who even know about this, let alone those who only just read the blog post about HTTP semconv stabilizing in November. And that blog post also only has a tiny number of views despite us trying to get the word out that things were going to change.

vishweshbankwar commented 7 months ago

@cartermp Users have an option to stay on the last released beta version which defaults to older conventions.

cijothomas commented 7 months ago

Why not just wait that extra month?

What benefit does waiting a month gives? gRPC conventions won't still be stable in a month, right?

Maybe we should make an opt-in experimental feature to re-enable the gRPC part, and make it available in the first stable release itself?

cartermp commented 7 months ago

Sorry, I crossed some streams here -- yes, gRPC isn't going stable in a month.

But I guess I don't understand what's going on anymore. If the goal is to stabilize the aspnetcore package, then adding gRPC support back in later makes it not stable anymore? What about new users who want the latest bug fixes (if they exist in the future) for aspnetcore instrumentation, and also want to use gRPC?

cartermp commented 7 months ago

I'll also reiterate that I don't think end-users have any hope at understanding what's happening here

cijothomas commented 7 months ago

If the goal is to stabilize the aspnetcore package, then adding gRPC support back in later makes it not stable anymore?

gRPC support will be under experimental feature flag only. The package will be stable, and offer aspnetcore instrumentation by default. Users who want to get gRPC feature, can opt-in for it, and it can remain opt-in until semantic conventions stabilize.

I think the first stable release should contain the exprimental+opt-in capability to support grpc attributes.

What about new users who want the latest bug fixes (if they exist in the future) for aspnetcore instrumentation, and also want to use gRPC?

Update to the newer versions having the bug fixes. For gRPC, same as my above comment : users can keep the opt-in feature for that.

cijothomas commented 7 months ago

I'll also reiterate that I don't think end-users have any hope at understanding what's happening here

Can't really see how comments like this are useful to anyone! If you are trying to help, please help with specific feedback, that we can address.

cartermp commented 7 months ago

My specific feedback is just hold off on releasing for a moment. If the plan is to reintroduce gRPC instrumentation, why not just build what you're planning to build and release that? There's three points of churn here:

Why do three when you can do two?

cijothomas commented 7 months ago

just hold off on releasing for a moment.

How would that help anyone? The 1st stable release of instrumentations are expected in under 2 weeks, so we should move forward and fix gRPC issue either by putting it under experimental feature flag in the stable release, or in the pre-release to be released on same time as stable.

My personal preference is to put gRPC support back under experimental feature flag, and release it as part of the 1st stable release.

vishweshbankwar commented 7 months ago

just hold off on releasing for a moment.

How would that help anyone? The 1st stable release of instrumentations are expected in under 2 weeks, so we should move forward and fix gRPC issue either by putting it under experimental feature flag in the stable release, or in the pre-release to be released on same time as stable.

My personal preference is to put gRPC support back under experimental feature flag, and release it as part of the 1st stable release.

I agree with this approach if gRPC instrumentation is what's needed then we can introduce an experimental flag OTEL_EXPERIMENTAL_ENABLE_GRPC_INSTRUMENTATION (or something like this). Note that users would still have to opt-in for it as opposed to having the instrumentation on by default.

Regarding

HTTP semantic conventions causing breaking changes downstream (these have the potential to be enormous in their own right)

This is an expected churn and was communicated well in advance. We have followed the formal transition plan defined in semantic conventions. All the languages are expected to go through it.

cijothomas commented 7 months ago

@vishweshbankwar Could you start/prepare a PR doing OTEL_EXPERIMENTAL_ENABLE_GRPC_INSTRUMENTATION, and bring it to todays SIG call? It'd be good to get it added and released as rc2 asap, and stick with stable release for dec mid.

cijothomas commented 7 months ago

@vishweshbankwar Could you start/prepare a PR doing OTEL_EXPERIMENTAL_ENABLE_GRPC_INSTRUMENTATION, and bring it to todays SIG call? It'd be good to get it added and released as rc2 asap, and stick with stable release for dec mid.

@cartermp Assume the above goes through, does it mitigate some/most/all concerns?

cartermp commented 7 months ago

Yes, thanks! I think that does help alleviate some concerns.

martinjt commented 7 months ago

This is an expected churn and was https://github.com/open-telemetry/opentelemetry-dotnet/discussions/4686 well in advance.

Who is that you're expecting to have seen that issue and catered for it? The thousands of people affected by the issue who are downloading the package from Nuget after watching .NET Conf or one of the numerous talks about OpenTelemetry?

cijothomas commented 7 months ago

This is an expected churn and was #4686 well in advance.

Who is that you're expecting to have seen that issue and catered for it? The thousands of people affected by the issue who are downloading the package from Nuget after watching .NET Conf or one of the numerous talks about OpenTelemetry?

Who is promising that this package is stable, and there won't be breaking change? The nuget is marked pre-release, and the readme.md clearly states its non-stable status. If someone else is falsely advertising the package as stable, what can this repo do? Did .NET Conf announced that OTel Instrumentation libraries are stable? OR are there talks which are falsely announcing OTel Instrumentation as stable?

Its totally understandable that the users would be affected by the breaking changes to this package, but it is inevitable for any non-stable package. If you want to help, please suggest specific, actionable suggestion. Happy to hear more ideas on how to communicate breaking changes more effectively, if you believe people won't read docs or look at changelogs. A few ideas were discussed in SIG meeting today, but they were relatively minor/small changes, except clearly documenting website docs: https://opentelemetry.io/docs/instrumentation/net/libraries/ (that is a separate issue that the docs from this repo and website are not yet merged :( )

cijothomas commented 7 months ago

Yes, thanks! I think that does help alleviate some concerns.

I think you are still seeing more concerns which are not addressed. How can we help bridge the gap?

Today's SIG meeting discussed some ideas to better communicate breaking changes from this repo, pasting below for easy reference. Please add your feedback to the list as well. (This is separate from the overall opentelemetry.io blogs about semantic conventions changes etc., and is focused on what OTel .NET group can do)

_Improve documentation and communication of experimental features Add link to GitHub readme from NuGet readme e.g., https://www.nuget.org/packages/OpenTelemetry.Instrumentation.AspNetCore#readme-body-tab

Add warnings to opentelemetry.io like are found on readmes: e.g., https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Instrumentation.AspNetCore#aspnet-core-instrumentation-for-opentelemetry-net

Update install package step in readme to install explicit version https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Instrumentation.AspNetCore#step-1-install-package_

martinjt commented 7 months ago

Who is promising that this package is stable, and there won't be breaking change? The nuget is marked pre-release, and the readme.md clearly states its non-stable status.

I think you're fundamentally misunderstanding users here. When something is announced, and demoed at a major Microsoft conference, no amount of "It's in the README on Github" makes it visible to users that you could break their telemetry at any point.

As I said, this is IMPLICIT, at no point during the Demos, or in the docs, does it say that we will remove functionality at any time, as we choose, and if you want to know about that you'll need to be reading the Github issues constantly. Unless I missed that part of the talks that were on .NET Conf?

Maybe look here? https://learn.microsoft.com/en-us/azure/azure-monitor/app/opentelemetry-configuration?tabs=aspnetcore (which is Azure Monitor specific, but I don't think anyone is naïve enough to suggest that Azure docs aren't used by users to understand Microsoft support).

You seem to be suggesting that noone should be using this library in production, and using the fact that it isn't "Stable" as a catchall for breaking things. That isn't OK. If we don't want anyone using this in production, noone should be demoing it in talks, let alone key Microsoft people in a flagship conference, atleast with explicit caveats that it will break, and that they need to watch the github issues to stay up to date. I get why it wasn't mentioned, because it's not a good look for Microsoft to be bigging it up like that, when it's not supported.

I'm not a maintainer here, I'm the voice of the hundreds, if not thousands, of users of these libraries in production today that I speak to every day. I can help with messaging, I can help you understand the frustrations of users. I can't help if your answer is that this is not stable therefore we can break it any time. If the answer is that anything, at any time, can break, we need more explicit and loud warnings from Microsoft who are demoing this. Perhaps make every method as Experimental in the all the libraries, since that makes it explicit, and forces people to add flags to every library (we could actually then release everything as stable).

To be clear, in other languages, there is no concept of "prerelease" in the way that nuget does it. Therefore upgrades of libraries in the 0.x range can be scoped to minor/hotfix only, which we can't do in .NET. That's the reason this is a more specific problem for us.

cartermp commented 6 months ago

I think you are still seeing more concerns which are not addressed. How can we help bridge the gap?

@cijothomas I do, but it's not specific to this repo / it's not specifically a .NET SIG problem. I do think that bringing the gRPC instrumentation back in (behind a flag) at least does reduce some of the coming churn -- so thank you for doing that.

The HTTP semconv work is massively breaking for a large portion of our customers in several ways and represents an enormous amount of work for them (and us) to migrate. For some customers this is on the order of needing to update thousands of alerts, SLOs, and dashboards -- most people don't use automation like terraform to set these things up, and in several cases this can't even follow an automated process such as terraform due to things like a CONTAINS operator on http.target, which is split rather than name-changed.

And so there's several bad things that can quite easily happen here:

We've done what we can to tell customers what's coming with the information we had, as I'm sure other vendors have, but the reality with enterprise software is that most people don't listen unless (a) something is horribly broken, or (b) they need to renew a contract. And while the OTel community has attempted to communicate things a bit, the reality is that it's been unactionable for end-users all year, there's only been two blog posts about this topic (each with only a few hundred views):

We have also struggled to understand the full impact and when it lands across all of OTel. The work was open, yes, but not terribly discoverable. It's been hard to also audit all language contrib repos and instrumentation libraries to understand what's affected and when we can reasonably expect to see breaking changes land.

cijothomas commented 6 months ago

@cartermp

Thank you for sharing your perspective, which aligns well with my understanding of the issue. If users are dependent on specific telemetry attributes, such as the attribute names etc. in their dashboards and alerts, migration becomes an inevitable and necessary step. Although it's a painful and costly process, it's something we must accept. Semantic Conventions Working Group did offer a migration plan in place to smoothen the process. (Given conventions were experimental, they could have simply made breaking changes, but they still took the effort to offer migration plans, which is really appreciated.) This plan suggested that instrumentations should include a feature flag to opt-in to new, old, or both conventions. After a grace period, the default would be set to follow the new convention. The instrumentations in this repository did adhere to this guideline. Of course, there are other instrumentation libraries too, and it's possible that not all of them have followed these guidelines.

Will do our best to help further. There is already an issue discussing some options, hopefully to further smoothen the process.

Given the challenging nature of preparing a perfect specification or convention on the first attempt, it's inevitable that OpenTelemetry components undergo the cycle of alpha/beta/stable/deprecated. Users who take the risk of using components before they reach stable status, and provide feedback, are making significant contributions to the greater good. If nobody used the components until they were stable, we would never receive any feedback, nor would we be able to make the necessary improvements to reach stability. It's essential to find the right balance between introducing new capabilities and being mindful of not disrupting existing users. Happy to hear more thoughts on improving the current process in this repo (and others).

cijothomas commented 6 months ago

I think you're fundamentally misunderstanding users here. When something is announced, and demoed at a major Microsoft conference, no amount of "It's in the README on Github" makes it visible to users that you could break their telemetry at any point.

I can help you understand the frustrations of users.

I'd like to emphasize that we all share the commitment to understanding and prioritizing OpenTelemetry users' needs. It's a team effort, and each perspective is valuable. If you have any specific feedback or insights regarding this repository, please do share them.

we need more explicit and loud warnings from Microsoft who are demoing this.

As I mentioned earlier, if Microsoft (or any company) falsely promised that OTel semantic conventions are stable, when they were not, then I don't think this repo can do anything about it. I do see such warnings already in their docs. If you have other feedback on Azure/Microsoft docs, please report it at their channels.

You seem to be suggesting that noone should be using this library in production

I never said anything like that. The component's status is mentioned in https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/src/OpenTelemetry.Instrumentation.AspNetCore#aspnet-core-instrumentation-for-opentelemetry-net. It says breaking changes are possible, and if users are willing to accept that, then nothing wrong with using in production!

pyohannes commented 6 months ago

If the answer is that anything, at any time, can break, we need more explicit and loud warnings from Microsoft who are demoing this.

I'll also have to deal with frustration related to this breaking change, however, it's inevitable. For me, a main point is that after this change nothing will break (without a main version increase), so I'm actually looking forward to this happening. Also, I think focusing on what was demoed on .NET Conf is a bit of a red herring, as the main problem lies with users who are using this packages for a longer time and rely on monitoring and alerting workflows that were built around it.

The one thing I'd suggest to make users aware that they'll get breaking changes is to change the main version number, and maybe release a 2.0 of those libraries. Although it's not strictly necessary according to versioning guidelines, it might help to warn users that something major is changing. I'm not sure though about all the implications, likely one would also need to increment the SDK version to keep things in sync.

cijothomas commented 6 months ago

The one thing I'd suggest to make users aware that they'll get breaking changes is to change the main version number, and maybe release a 2.0 of those libraries

Yes! We should be following this guidance, if we ever make breaking changes post the stable release. https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/VERSIONING.md#versioning-details

cijothomas commented 6 months ago

@pyohannes On reading again.. were you referring to releasing the 1st stable version of instrumentations as 2.0, instead of the current plan of 1.6.0? That'd look awkward, especially since we have decided some time ago to synchronize version of the instrumentations with the api/sdk, and making api/sdk to 2.0 will be confusing for users as they have no breaking change warranting major version bump!

pyohannes commented 6 months ago

That'd look awkward, especially since we have decided some time ago to synchronize version of the instrumentations with the api/sdk, and making api/sdk to 2.0 will be confusing for users as they have no breaking change warranting major version bump!

As this is the first stable release, going with 2.0 isn't a hard requirement for me, I just wanted to put out a proposal to possibly mitigate issues brought up here.

However, the logic of strict coupling versions of the SDK and of instrumentations seems a bit problematic, as the question will arise again once there are breaking changes in a stable instrumentation library that don't affect the SDK (hopefully not anytime soon) or vice versa. You'll be forced to either have an awkward main version increase without compatibility break, or you'll have decoupled versions.