open-telemetry / opentelemetry-go-instrumentation

OpenTelemetry Auto Instrumentation using eBPF
https://opentelemetry.io
Apache License 2.0
421 stars 65 forks source link

Add an opt-in environment variable for experimental instrumentation #785

Open MrAlias opened 2 months ago

MrAlias commented 2 months ago

Instrumentation for packages that do not have a stable output format yet (i.e. semantic conventions are not stable) should not be on by default. It leads users into a false sense of stability.

Instead, this instrumentation should be enabled only if a user sets an environment variable to explicitly opt-in. The environment variable should be prefixed with something like OTEL_GO_EXPERIMENTAL*, OTEL_GO_EXP*, or OTEL_GO_X*. This will better communicate to users the experimental and unstable nature of the instrumentation they will use.

This needs to be accompanied by documentation so a user, when their interest is piqued by the environment variable, can understand the implications of using these experimental instrumentation.

Alternatives considered

MrAlias commented 2 months ago

This will include instrumentation like kafka and database.

damemi commented 2 months ago

Would add that a clear definition of what falls under the experimental flag to help us and users. I think that includes:

Basically if any of these aren't stable, the library would fall under our experimental definition. Or we could discuss which of these need to be met if not all 3

MrAlias commented 2 months ago

I agree that the semantic conventions stability is a hard requirement for something to not fall under the experimental flag.

As for the API and package stability that may be a bit harder. We already depend on non-exported types to instrument packages. Including instrumentation for net/http. There are not guarantees, even for stable packages, that these will not change. If we require stable APIs for our stability we will need to re-evaluate how we instrument things or have very little stability.

That said, I can see an argument for requiring stable packages (instead of APIs). This would help our developer burden with minimized churn as the package evolves.

damemi commented 2 months ago

@MrAlias totally agree, I was just proposing those 3 requirements. I think ultimately if it's not user facing, then it doesn't have to be a requirement for our stability level. If the underlying library changes and breaks us on main, we'll still have compatibility versioning for our published releases.

edeNFed commented 2 months ago

Is this feature implemented in other auto instrumentations as well? If I remember correctly, Java/NodeJS auto instrumentations generate spans for Kafka/databases by default even though the semantic conventions are not stable yet.

pellared commented 2 months ago

If I remember correctly, Java/NodeJS auto instrumentations generate spans for Kafka/databases by default even though the semantic conventions are not stable yet.

Same for .NET. The feedback we got is that all of the automatic instrumentation users prefer experimental features enabled by default. We only mark that something is experimental via docs. See: https://opentelemetry.io/docs/languages/net/automatic/config/. We have an issue to allow changing the default behavior, but even after one year no user asked for it https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/issues/2439.

damemi commented 2 months ago

That's a good point @edeNFed @pellared, I didn't think to consider how other auto-instrumentation agents handle this. I think not deviating from the precedent set by other languages (along with the demonstrated user preference) makes a compelling case to not do this by default.

@MrAlias what do you think? I don't think we talked about how other languages handle this in the sig call last week

MrAlias commented 2 months ago

Is there data showing how users feel when things break? I imagine there are many users that want it on by default, but is there any data showing how do user's feel when those instrumentation libraries change their output and break their dashboards/alerts?

I would like examples of users not minding that their observability systems break due those kinds of changes. I have many examples of the opposite where users are not happy with these kinds of changes.

There are also observability providers who have voiced not wanting to impact customers with breaking changes once adoption becomes large enough. Whether it is stable or not. Which means things become de-facto stable even though they are on unstable semantic conventions.

MrAlias commented 2 months ago

I'm not opposed to following other languages. But I'm skeptical that they have seen the fallout from the negative impact making this choice has caused yet.

pellared commented 2 months ago

I imagine there are many users that want it on by default, but is there any data showing how do user's feel when those instrumentation libraries change their output and break their dashboards/alerts?

I was expecting the same in .NET thus I proposed https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/issues/2439. Surprisingly so far no user really asked for it.

MrAlias commented 2 months ago

I imagine there are many users that want it on by default, but is there any data showing how do user's feel when those instrumentation libraries change their output and break their dashboards/alerts?

I was expecting the same in .NET thus I proposed open-telemetry/opentelemetry-dotnet-instrumentation#2439. Surprisingly so far no user really asked for it.

@pellared have you introduced breaking changes to the telemetry output in any instrumentation?

pellared commented 2 months ago

I see that Java released 2.0 when they changed HTTP sem conv: https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.0.0

MrAlias commented 2 months ago

I see that Java released 2.0 when they changed HTTP sem conv: https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/tag/v2.0.0

Right, they went through the migration for that though. That may not be relevant data given that was not a hard breaking change like the ones we will implement.

As mentioned in the description, migration is also something we discussed. But we also understood that will be a lot of extra development work to migrate all changes while the semantic conventions are not stable.

pellared commented 2 months ago

@pellared have you introduced breaking changes to the telemetry output in any instrumentation?

As far as I remember we the HTTP instrumentation had breaking changes in https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/releases/tag/v1.3.0

The upcoming release is going to have another one because of https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Instrumentation.Http/CHANGELOG.md#181

EDIT: I do not say that we should do the same for Go Automatic Instrumentation. Just feeding us with input.

MrAlias commented 2 months ago

@edeNFed would it be helpful to expose this as an option as well? Something like WithAllInstrumentation?

That way any distribution of this project could make all the experimental instrumentation enabled.

edeNFed commented 2 months ago

@edeNFed would it be helpful to expose this as an option as well? Something like WithAllInstrumentation?

That way any distribution of this project could make all the experimental instrumentation enabled.

Sounds good to me

pellared commented 2 months ago

@edeNFed would it be helpful to expose this as an option as well? Something like WithAllInstrumentation?

That way any distribution of this project could make all the experimental instrumentation enabled.

Do we want a single option to enable all potential experimental features such us experimental detectors, experimental exporters, OTel Profiling and not just instrumentations?

MrAlias commented 2 months ago

@edeNFed would it be helpful to expose this as an option as well? Something like WithAllInstrumentation? That way any distribution of this project could make all the experimental instrumentation enabled.

Do we want a single option to enable all potential experimental features such us experimental detectors, experimental exporters, OTel Profiling and not just instrumentations?

I'm open to other ideas. It was more of a general proposal.

What design do you think would work best?

pellared commented 2 months ago

My personal preference is still to have experimental features as opt-in, but I think users should be able to enable all of them in a easy way.

Maybe WithExperimental option and OTEL_GO_AUTO_EXPERIMENTAL env var (defaults to false)?

pellared commented 2 months ago

I have talked with our Product Manager and I am putting the summary below.

We SHOULD have experimental features/instrumentation disabled by default.

The documentation MUST say which instrumentations produce stable telemetry and which are experimental.

We MUST have env var toggles for each feature/instrumentation so that users can disable them. E.g. https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/blob/main/docs/config.md#instrumentations

We SHOULD have an env var toggle (we can add it even post-GA or someone explicitly asks for it) to control whether by default experimental features should be enabled or not. E.g. OTEL_GO_AUTO_EXPERIMENTAL which defaults to false.

Modified after https://github.com/open-telemetry/opentelemetry-go-instrumentation/issues/785#issuecomment-2064075838

MrAlias commented 2 months ago

I still think the stability guarantees we are going to have to provide on unstable instrumentation out-weigh the benefit to the uneducated user wanting to see lights light up.

We can always change the opt-in prior to the GA if we find users desire different behavior. Going the other way will cause frustration.

I think we should move ahead with the opt-in approach and have it so distros are enable to make a different decision.

akubik-splunk commented 2 months ago

Observations from introducing experimental features to commercial portfolio

  1. If it is on by default customers will expect it should be tested, relatively safe and supported - if it isn't they escalate and not always want to listen to reason
  2. If it is on by default and works for sometime customers become dependent on it and when it changes later priori GA they escalate and claim impact on their experience

In my opinion it should be opt-in with ability to enable all instrumentations or experimental items. It is clear then that this was a decision of the user (conscious or not). Even if that approach does not give them all the fireworks it eventually gives better experience in adoption and business outcomes