Closed jack-berg closed 7 months ago
edit -- my reference to merged config files is in the sense that tooling exists to merge yaml config together, I presume that something to accomplish this will be part of the tool chain for managed environments
It would be good to hear from platform providers as to how feasible it is to insert files and tooling into user environments. I suspect it is too difficult or restrictive for some environments, but if that solves the problem, I would agree that it would be a fine solution.
Either way, as long as we know that this feature is technically compatible with the design being proposed, I don't think we need to hold up the config SIG while we investigate and settle on a solution for this problem.
As to choosing the env vars, if the explicit goal is platform support, that provides a clear way to make a decision, so I think it could avoid most of the bikesheding.
More details here
That's option 2d by the way. And overriding YAML files with another YAML file is optional, not mandatory to implement it.
Thinking more on this. The issue with supporting the old env vars is that how they should be applied is ambiguous. Except, for env vars that set resources it is actually unambiguous. The resource section is always at the same path in the config file, correct? So there's no potential confusion as to how they would be applied.
If that is the case, there would be no reason why we could not continue to support all of the env vars that set resources. This also would provide a clear definition as to which env vars are supported, so there's no confusion there either. And allowing platforms to set resources via env vars makes a lot of sense: developers do not necessarily know which resources a platform has available to it. Allowing operators to add additional resources without needing to request that application developers push a new config file is also very helpful.
@jack-berg @codeboten @lmolkova what do you think of this proposal? Resource env vars are supported directly, but everything else requires a config variable.
@ocelotl thank you for your investigation! But, I think the goal is to support configuration via config file, not just pipeline development. We do not want to keep adding additional env vars, and in general it is not clear how env vars would be applied to complex pipelines. That's part of the reason why we want to move to file-based configuration.
I feel like creating a mix of 'these variables work all the time' and 'these variables work only some of the time' is worse, though?
An option could be to deprecate the env vars that don't work all of the time.
And allowing platforms to set resources via env vars makes a lot of sense: developers do not necessarily know which resources a platform has available to it.
Platforms should set platform specific resource attributes via resource detectors. OTEL_RESOURCE_ATTRIBUTES
is the wrong tool for the job.
And allowing platforms to set resources via env vars makes a lot of sense: developers do not necessarily know which resources a platform has available to it.
Platforms should set platform specific resource attributes via resource detectors.
OTEL_RESOURCE_ATTRIBUTES
is the wrong tool for the job.
It would mean that every platform would need to implement a detector in every language and ship a package containing this detector. I suspect they won't, at least smaller ones (from Azure perspective, I doubt that we will be able do it for every lang). It would make user experience worse - there won't be a consistent way for cloud providers/etc, no consistent way for users across languages, and there would be more packages, version conflicts, dependencies, etc.
If we expect everyone who wants to integrate with otel do something, we should provide them a convenience. Existing resource env vars do and are already used for this purpose, so why is it a wrong tool?
Thinking more on this. The issue with supporting the old env vars is that how they should be applied is ambiguous. Except, for env vars that set resources it is actually unambiguous. The resource section is always at the same path in the config file, correct? So there's no potential confusion as to how they would be applied.
I do support this proposal. I still think we might need a few extra env vars for exporter and propagators
We can either investigate if more env vars should interop with the config or start with resource ones and leave a room for more env vars to be added in the future to the 'interop' list if we'll find them necessary.
I also support @trask proposal to deprecate some env vars. I suggest starting with those which don't have a good interop story with the config.
Platforms should set platform specific resource attributes via resource detectors.
OTEL_RESOURCE_ATTRIBUTES
is the wrong tool for the job.
Resource detectors are for handling platforms which are not opentelemetry-aware, as we don't have any other choice. It is much better for platforms to set their resources directly in a language-independent manner, free of maintenance overhead and version mismatches.
Existing resource env vars do and are already used for this purpose, so why is it a wrong tool?
I regret bringing this up because I think its besides the point of this conversation. With so much debate on this particular issue, its especially important to not get sidetracked.
The ideas that have been discussed most recently in this thread are NOT compatible with the configuration working group's recommendation. The main idea seems to be to have the existing env vars override the contents of a config file where a clear mapping is possible, and to deprecate the existing env vars which do not have a clear mapping. If this were the direction, we would not want to have starter templates which reference all the existing env vars and their defaults using the substitution syntax, since doing so would muddy the waters even more. And without this requirement, we probably would think twice about introducing the env var substitution default syntax. Additionally, the config working group recommendation only needs to define the behavior of ignoring existing env vars when OTEL_CONFIG_FILE
is specified, but this idea would need to define whether or not existing env vars take precedence when parse and / orcreate are called, or only when OTEL_CONFIG_FILE
is specified.
This is to say that the recent ideas in the thread are an entirely new proposal, rather than a small modification to the recommendation. As I've mentioned several times, I'm unsure how to move on from here. There were 6 proposals considered by the SIG. This set of ideas represents a 7th. It would be incorrect for us to go and change the recommendation after concluding the process, and furthermore I personally wouldn't want to. The courses to take from here appear to be:
@jack-berg I think the issue with the evaluation document is that it did not provide an evaluation framework, and thus the recommendation is difficult to justify (see my rant on "pros & cons"). There is rarely an option that beats all other options on all decision dimensions, but when those criteria are not even defined clearly then the comparison will always look not convincing. On the other hand, if the decision was documented with a decision framework, then adding proposal #7
to it would be relatively straightforward and a new (or same) recommendation can be issued.
One very useful decision framework is Traffic Lights (based on quick googling this blog post does a decent job explaining it). It's not difficult to transform the existing pros & cons already collected in the doc and in this thread into a traffic lights matrix. Some of the criteria that I would consider important are:
The important aspect of traffic lights method is to make sure that people's concerns are heard and reflected in the evaluation. For example, @lmolkova is worried about existing documentation and backwards compatibility - so we should add that to the matrix and compare different options on this axis (e.g. 2b would be yellow, Ted's proposal of resource-only env vars would also be yellow).
The document summarized the options, but was paired with a comment which did describe an evaluation framework. It was not a traffic lights framework, but included evaluation criteria with an emphasis on the high priority criteria, a defense of the recommendation including explanation of weaknesses, and summary of why the others were rejected.
Maybe otel could / should adopt a decision making framework like traffic lights, but no such precedent exists. One problem I can think without having used traffic lights in anger is disagreement over the color of the light, and over the relative significance / weight of each of the evaluation criteria. In the process for making this recommendation, the configuration-maintainers voted repeatedly until a winner was selected - similar to what the TC might do. How do you describe each individual's process for deciding how to vote in a summary of a recommendation?
There's an additional meta-issue with this: We described a process to collect and summarize all the different points of view, and make a decision. We advertised that process and executed it in good faith. It was a significant investment in time. We drew a line in the sand, and the process wasn't perfect but it was pretty good and organized compared to what I've encountered elsewhere in this project. Re-initializing the conversation sends a signal that processes like this have no teeth, and hurts decision making since there's no penalty for not participating (i.e. the conversation is never really over). (Note: This is not targeted at anyone, or even at this particular issue. I've noticed this with a number of issues over the years and it seems relevant now.)
I find the process you outline here https://github.com/open-telemetry/opentelemetry-specification/issues/3752#issuecomment-1995582317 to be awesome.
I agree with @yurishkuro that there could be more evaluation criterias. I'd add something around these lines:
I think user experience is more important than "how much extra work on maintainers a proposal creates", but both have a good place in the evaluation criteria list.
If we had something like this as prereqs for the significant changes (does not degrade existing features, prototyped and evaluated, not too complex), there would be less last-minute feedback.
If we had something like this as prereqs for the significant changes (does not degrade existing features, prototyped and evaluated, not too complex), there would be less last-minute feedback.
The equivalent of this was made clear in the discussions, the higher support cost from this and equivalent proposals was highlighted in the discussion agenda. The TC decided to not prioritize that sufficiently, we have to accept that. I thought my subsequent proposal to make the default support env vars using a merge of yaml was consistent with the chosen proposal, but @jack-berg points out it isn't because the proposal wording includes templates for these. We have to accept that too, (though I think this is a good compromise and we should be a little flexible about process vs best outcomes). So it's clear that to add this we need to escalate for a TC decision. For me, I feel like the people who would decide this have viewed the points here in the discussion and are not supportive of adding these defaults, so I don't see the point in such an escalation, though I'll happily support anyone who decides the effort is worthwhile
I want to emphasize, if a new feature degrades/breaks existing stable experience, it should not happen. No amount of "keep things simple" or "decision has already been made" can justify breaking/degrading user experience.
@jack-berg having been on both sides of this process multiple times, I agree that it feels broken when a SIG ends up having to rehash everything when a proposal is brought to the community, and essentially having the entire debate over again so that the proposal can be approved.
Your (very helpful) design breakdown does a good job of listing the criteria that the design proposals are evaluated against. Because of the work done by you and the SIG, I actually think the debate happening now is helpful, but could be structured better. Let me explain why.
I've been involved in many of the major design decisions in Otel (tracing, context propagation, error handling, etc, etc). For major design decisions, it is often the case that when the proposals go public, requirements that the designers miss are brought in by community members who were not part of the internal design process. This is normal; ensuring that our designs meet all requirements is one of the reasons we have a public review process. OTel must work across many languages, runtimes, and platforms, and it must be careful about breaking compatibility. Metrics is a major example of a design that took three complete rewrites before meeting all requirements. That was unfortunate, but if we had refused to honor those late-breaking requirements, our metrics solution would have been a failure. Honestly, with something as major as a new configuration model, requirements and feedback from the wider community should be expected. Especially if the design proposes a perceived break in compatibility!
Anyways, my point is that I don't actually think that the debate we are having here is unnecessary. Right now a new requirement is being proposed – namely, that our definition of compatibility should be stricter than the definition that was used to drive the current design. Compatibility is very important, it's reasonable that we would spend more time gathering real-world examples and finding clarity on what we actually need here.
What is perhaps making this conversation difficult is that we are mixing requirement gathering with designing. Proposed requirements are getting glued together with particular solutions. That's crazy making. I agree with @yurishkuro's comments, and I recommend that we back off from talking about solutions for a bit. Let's first get agreement on what our compatibility requirements actually are. Once we actually agree on requirements, I suspect that the design solution will be fairly obvious.
If we want to elevate this decision to the TC, that's fine. But we need the TC to decide on the requirements, not the solution. Let's spend this next week getting our requirements gathering into a clean document, so that we can actually see what we are talking about. For debated requirements, we can record the different opinions on what the requirement actually is, along with real world examples. No mention of solutions until we finish this work. Perhaps the result of this process will improve how we make proposals in the future.
Let's first get agreement on what our compatibility requirements actually are.
I am going to look into this and if necessary will take to the TC for clarification. I see confusion and variety in opinions, which need to be clarified regardless of what we decide for config. I will post back when I have an update.
I think the approach proposed by the workgroup is a reasonable choice. It brings a welcomed conceptual simplification, and sets us up well for things like remote SDK configuration.
Below I collected some points about features that the proposed solution will not support, as compared to our current solution. I don't see any of these as a blocker, but I want to list them here as part of this discussion, as it seems some people consider them as "implicit requirements" in the context of backward compatibility.
It's not possible anymore to override implicit defaults.
The loss of this ability surprised some users to whom I talked about the proposed configuration approach. Currently, in .NET for example, users can add exporters with default values, which then are overwritten by environment variables. For example, .AddOtlpExporter()
is called on the tracer provider, and then OTEL_EXPORTER_OTLP_ENDPOINT
is set. As far as I can see, this is a use case that will not be supported by the proposed model: all defaults that should be overwritten must be specified explicitly. While I don't think this is necessarily a bad change, it is a behavioral change that will surprise some users.
I'm not sure if the template approach can cover all currently supported environment variables.
Is it possible to have a template that honors all existing environment variables? Or will this be a best-effort approach? It seems to me that environment variables using key-value pairs and lists cannot be used with this approach. I'm also not sure how variables like OTEL_TRACES_EXPORTER
can be supported.
We don't have guaranteed consistency anymore in environment variable overrides.
Although we'll provide templates, nothing will stop users from using ${OTLP_ENDPOINT}
in one place, and ${OTEL_OTLP_ENDPOINT}
in another. The consistent set of environment variables we have now is hard to maintain, however, it's a nice-to-have from a user's point of view: it gives a consistent experience across languages, and it avoids unpleasant surprises.
The loss of this ability surprised some users to whom I talked about the proposed configuration approach. Currently, in .NET for example, users can add exporters with default values, which then are overwritten by environment variables. For example, .AddOtlpExporter() is called on the tracer provider, and then OTEL_EXPORTER_OTLP_ENDPOINT is set. As far as I can see, this is a use case that will not be supported by the proposed model: all defaults that should be overwritten must be specified explicitly.
I'm not sure how this maps to the normal .NET programmatic config process. I know in .NET the pattern involves a combination of programmatic config with elements that implicitly read from env vars. File config implies more of a one-liner config process, where the user calls something like OpenTelemetrySdk.initialize()
(or equivalent) which detects that OTEL_CONFIG_FILE
is set, parses it, creates SDK components from the model, and returns those SDK components to the caller. I'm sure OpenTelemetry .NET will find a way to balance the API so that a user can start with a config file and optionally layer on additional config programmatically, but its really a different paradigm. Where the current env var based config almost necessitates some programmatic config because so many options are missing, the file config model aims to be a near exhaustive representation of what can be done programmatically. Maybe some users will still want to combine programmatic config with file config, but that's kind of missing the point.
Is it possible to have a template that honors all existing environment variables?
No, it would be a best effort approach. #3948 represents the recommendation, and has an associated PR in opentelemetry-configuration
: https://github.com/open-telemetry/opentelemetry-configuration/pull/76/files. See the proposed starter template comments for a list of env vars which don't map well and would be ignored.
We don't have guaranteed consistency anymore in environment variable overrides.
Yes, this is correct.
A point I haven't heard made yet with respect to compatibility and the env vars:
The goal of the env var spec is to standardize names of env vars where there is commonality between implementations. SDKs are not required to implement the env vars, and how they are implemented is left intentionally open ended:
The goal of this specification is to unify the environment variable names between different OpenTelemetry implementations.
Implementations MAY choose to allow configuration via the environment variables in this specification, but are not required to.
Environment variables MAY be handled (implemented) directly by a component, in the SDK, or in a separate component (e.g. environment-based autoconfiguration component).
This certainly leaves room for the introduction of a more prescriptive (and possibly net-new) component for handling file based configuration. I'm not sure how to read the intentionally loose language and conclude that we are restricted from introducing something new, quite different, and opinionated.
Environment variables are marked as stable. SDKs that implement them (virtually all) are mostly marked stable. SDKs can't stop supporting env vars without major version update.
Users that set env vars should be able to keep setting them and if we deprecate some, we'd need to provide at least some back-compat support for them anyway.
Please let's stop prioritizing new non-existent features over existing and popular ones.
SDKs can't stop supporting env vars without major version update.
I'm not advocating for stopping supporting env vars. Just providing an alternative door (door b). I see no reason to not support door a indefinitely - at least in opentelemetry-java
where we strongly oppose revving the major version.
Please let's stop prioritizing new non-existent features over existing and popular ones.
Popular due to lack of alternative. The single most common type of issue I respond to in opentelemetry-java
is asking for things not supported by the env vars. My response is always that new env vars need to be added to the spec, but that there is a moratorium in place making it difficult to add new ones. Just look how many times #2891 has been linked to. The user story is painful when env vars don't exist to express what you want, which occurs quite often. The single most popular issue (61 up votes at time of writing) in java instrumentation is about a lack of expressiveness in the env var syntax for describing non-trivial sampling situations. We wrote a dedicated view file config tool in opentelemetry-java
to stop the bleeding and provide desperately needed configuration of things like explicit bucket boundaries, reducing cardinality, and dropping unneeded metrics. Based on the number of times I refer people to it answering issues, its quite popular.
If file config turns out to not be popular / useful, then people won't set OTEL_CONFIG_FILE
and there's no problem with conflicting env vars to worry about 😁.
If we stabilized something imperfect in the past, we can't just say we do something else now. We have to find a way to update gracefully and keep all the good things that we had.
So I'm suggesting to keep both doors open at the same time in https://github.com/open-telemetry/opentelemetry-specification/pull/3948#issuecomment-2015572791.
We can and should review existing env vars and fix/deprecate some of them. In the current proposal there is no attempt to fix env vars, there is an implicit attempt to drop them eventually (or make sure nobody uses them) in ungraceful manner.
Without taking a position on whether the existing env vars should or should not interop:
I think we should deprecate any env vars that do not interop with yaml config.
Why?
I think OpenTelemetry should have a clear recommendation for users on how to configure SDK + Instrumentation.
We know that yaml config is required to support some popular user requests, namely metric views and attribute-based sampling (the 4th most upvoted issue across all of OpenTelemetry).
So, if our recommendation is to "start with env vars", then we know that we are steering a lot of users down a one-way path.
I think it will be a much better user experience for everyone to start directly with yaml and not need to rewind and go down a different path later.
SDKs of course will have to support the deprecated env vars (without yaml interop) at least until their next major bump.
But by deprecating the env vars that do not interop with yaml config, we give a clear signal to users about the path they should take when onboarding to OpenTelemetry.
I think we should deprecate any env vars that do not interop with yaml config.
I'm not opposed to that, but we should consider the timing:
File config is still a very experimental idea. This is in large part because how contentious the PRs have been (note that these PRs were for the most part just restating things that had already been approved in the original otep):
There's still a lot of work to be done, especially:
Based on the history of getting things done for file config, and of getting similar schemas like opentelemetry-proto
from experimental to stable, I'm disappointed to say that I don't see a stable file config spec as realistic in the short / medium term future.
Deprecating env vars significantly before file config is set to be stable sends a bad signal to the user: env vars are stable but deprecated, but the replacement is experimental without a target date for stability. I'd like to see env vars and file config coexist without deprecating env vars until there's a realistic prospect of marking file config stable.
+1 to eventually deprecating some or all old env vars, ideally to have only one way of configuring the SDK. This of course can only happen sometime after the new way of configuring has a stable spec and is widely implemented by SDKs.
Hi All - Please see this comment updating the status of the issue:
Per @tedsuo’s request, we discussed this issue in the 3/24/27 TC meeting and have made a decision: Generally, we will follow @trask's comment, proceeding with this PR with a few changes:
- Rename OTEL_CONFIG_FILE to OTEL_EXPERIMENTAL_CONFIG_FILE, reflecting the fact that the semantics around how the value of env var are subject to breaking changes as the file configuration spec and schema continue to evolve.
- Ensure that env vars which don’t interop with file config are deprecated when file config is ready for stabilization, reflecting that we do not want to recommend multiple competing configuration stories. This could be ensured via an explicit note in the markdown, or a blocking issue - both achieve the same effect. https://github.com/open-telemetry/opentelemetry-specification/issues/3967
- Ensure that file config has an interop where platforms (i.e. Azure functions, otel operator, etc) contribute to config. We should proceed with #3948 without being prescriptive about how that mechanism works. In the TC meeting, 4 distinct solutions were discussed which had different tradeoffs and limitations. It is clear that we still need to learn more about the requirements and constraints of this use case and let the findings inform the solution. The config working group should prioritize this discussion, but an answer shouldn’t block this PR. We should open a new issue to track the requirements and discuss solutions, and ensure that we treat that issue as blocking for any sort of stabilization effort (although it should ideally be solved much sooner). https://github.com/open-telemetry/opentelemetry-specification/issues/3966
The conversation about whether file configuration should completely ignore the sdk environment variable scheme came up in #3744, but that PR doesn't actually contain any language related to this.
The original file configuration OTEP stated:
As mentioned here, file configuration doesn't actually contain language describing this behavior. It was included originally included in #3437 but was lost in the PR review shuffle - accidentally, not in response to feedback.
@tedsuo argues in favor file configuration respecting env vars with:
@MrAlias argues in favor of ignoring env vars with:
@trask supports the feeling of users expecting env vars to override file configuration, but also says merging configuration from multiple sources is hard:
This topic came up several times during the lengthly review of the file configuration OTEP. Below are links to a number of and relevant points:
https://github.com/open-telemetry/oteps/pull/225#discussion_r1116269308
https://github.com/open-telemetry/oteps/pull/225#discussion_r1119068865
https://github.com/open-telemetry/oteps/pull/225#discussion_r1142380977
Update 3/15/2024
The current state of this issue is:
Update 3/28/2024
Please see this comment updating the status of the issue: