Should file configuration merge environment variable configuration?

jack-berg commented 11 months ago

The conversation about whether file configuration should completely ignore the sdk environment variable scheme came up in #3744, but that PR doesn't actually contain any language related to this.

The original file configuration OTEP stated:

Interpret the configuration model and return SDK TracerProvider, MeterProvider, LoggerProvider which strictly reflect the configuration object's details and ignores the opentelemetry environment variable configuration scheme.

As mentioned here, file configuration doesn't actually contain language describing this behavior. It was included originally included in #3437 but was lost in the PR review shuffle - accidentally, not in response to feedback.

@tedsuo argues in favor file configuration respecting env vars with:

The common expectation among developers is that env vars will automatically overwrite config parameters. If we do it the other way, I am concerned that until the end of time we will have a steady stream of users lodging issues about this and then becoming very frustrated when we explain that they need to modify the config file to use an env var.

The reason that these users will be upset is that they are in a situation where having to define the env vars in the config file is a non-starter. Their use case is that for some reason they either can't get at or are not allowed to modify the config file template, but they really need to override a parameter.

For reference, this is an issue that has come up on many OSS projects I have been involved with where it is common to have both operators and application developers wanting to configure the same thing. Often it's some kind of emergency situation where the person with the rights to change the config file is unavailable.

I should note that of course the opposite situation could be true, where for some reason you want to disable an env var but don't have access to it. But it's probably easier to give users the ability to disable an env var via the config file than it is to give them the ability to disable a config parameter via an env var.

@MrAlias argues in favor of ignoring env vars with:

To be fair, from experience, you're going to get people complaining either way. If you choose environment variable priority over a configuration users will complain that their deployment was altered and failed when an environment variable was set that took precedence.

However, if you make environment variables take precedence you will also need to make some pretty sever and subjective choices on how they are mapped to a config. Do the BSP environment variables apply to all batch span processors or just one? Is the sampler environment variable use for all tracer providers in the config, even ones that specify alternate samplers? Should propagators be merged or overridden? If an exporter is defined by environment, does that stop the console logging exporter used for debugging as well as all the other exporters defined in configuration?

Ultimately, I think the current changes are going to be the most appropriate. They allow users to make their own choice in precedence without making subjective choices for them on how to map things. If a user want environment variables to take precedence, all they need to do is use the OTel environment keys in the related parts of their configuration. In doing so they will answer each of the above questions their own way.

@trask supports the feeling of users expecting env vars to override file configuration, but also says merging configuration from multiple sources is hard:

This is my feeling as well, especially for things like:

OTEL_SDK_DISABLED OTEL_RESOURCE_ATTRIBUTES OTEL_SERVICE_NAME OTEL_LOG_LEVEL OTEL_PROPAGATORS OTEL_TRACES_SAMPLER_ARG I totally understand the nightmare that is merging configuration from multiple sources though.

I wonder if we would have created many of the other env vars (e.g. OTELBSP, OTELBLRP) if we had configuration file support from the beginning? And if so, maybe we can deprecate those other env vars in favor of configuration file?

This topic came up several times during the lengthly review of the file configuration OTEP. Below are links to a number of and relevant points:

https://github.com/open-telemetry/oteps/pull/225#discussion_r1116269308

Layering of config as described below would make it more difficult to reason about what my program is actually being run with.

Additionally, perhaps give them a helper config file that uses env var substitution in the right places so that they can migrate easily (and still get a warning until they get rid of env variables and move everything to the config file).

https://github.com/open-telemetry/oteps/pull/225#discussion_r1119068865

What about 'Solution 3: fail when both environment configuration and file configuration are present'?

I think we could log a warning when we detect this, but failing is too strong. Consider the implications if a user is operating in an environment they don't fully control (i.e. where an ops team configures environment variables by default which they extend / layer on top of).

https://github.com/open-telemetry/oteps/pull/225#discussion_r1142380977

I have a (rather strong) opinion that setting set via env vars has higher priority (takes precedence) over a setting set via configiuration file and it should be marked as a goal.

Any scheme where environment variables have priority over a config file will require some sort of standard mapping between the environment variable schema and file config scheme. IMO, its impossible to define such a mapping which is intuitive in all cases, so better not to try.

Nothing is forcing users to use file based config - its opt in. If they do opt in, they're opting into the documented behavior in which the config file represents the source of truth for configuration. If they wish to customize the experience with additional layers / overrides, they have a couple of tools:

They can use the fact the Configure(config) API accepts a config model as an argument, and provide their own customizations to the model after the initial parsing of the file via Parse(file). An example of such a customization would be to interpret environment variables and apply them to the model in a way they decide makes sense.

They can use environment variable substitution to reference environment variables directly in a configuration file.

Update 3/15/2024

The current state of this issue is:

The configuration working group proposed and executed a process to move forward after several months of failing to reach consensus
As a part of that process, the configuration maintainers reviewed proposals and have recommended moving forward with option 2.b as described in this comment
As a part of that process, people who disagree with the recommendation can / should escalate to a TC decision

Update 3/28/2024

Please see this comment updating the status of the issue:

Per @tedsuo’s request, we discussed this issue in the 3/24/27 TC meeting and have made a decision: Generally, we will follow @trask's comment, proceeding with this PR with a few changes:

Rename OTEL_CONFIG_FILE to OTEL_EXPERIMENTAL_CONFIG_FILE, reflecting the fact that the semantics around how the value of env var are subject to breaking changes as the file configuration spec and schema continue to evolve.

Ensure that env vars which don’t interop with file config are deprecated when file config is ready for stabilization, reflecting that we do not want to recommend multiple competing configuration stories. This could be ensured via an explicit note in the markdown, or a blocking issue - both achieve the same effect. https://github.com/open-telemetry/opentelemetry-specification/issues/3967

Ensure that file config has an interop where platforms (i.e. Azure functions, otel operator, etc) contribute to config. We should proceed with #3948 without being prescriptive about how that mechanism works. In the TC meeting, 4 distinct solutions were discussed which had different tradeoffs and limitations. It is clear that we still need to learn more about the requirements and constraints of this use case and let the findings inform the solution. The config working group should prioritize this discussion, but an answer shouldn’t block this PR. We should open a new issue to track the requirements and discuss solutions, and ensure that we treat that issue as blocking for any sort of stabilization effort (although it should ideally be solved much sooner). https://github.com/open-telemetry/opentelemetry-specification/issues/3966

jack-berg commented 7 months ago

Hi all - we’ve been discussing this issue for a while now. Many proposals have been put out there, and we’ve failed to reach consensus. I talked with the other @open-telemetry/configuration-maintainers and have worked out a path forward. We (the configuration-maintainers) are going to vote. Before we do, we’d like each proposal that folks want considered to be summarized and discussed in the upcoming 3/4/24 SIG meeting. Afterwards, the configuration maintainers will vote, and will recommend to the TC that we go with that option. Obviously there have been many participants in this conversation who are no configuration-maintainers - if you strongly disagree with the recommendation, you can escalate and request a TC decision. More details in 3/4/24 agenda entry in the configuration working group meeting notes.

The action item for folks is to go and add proposal summary to the meeting notes if you feel strongly about a particular proposal. Please attend the 3/4/24 SIG and discuss the proposal. If you support a proposal put forth by someone else, you can express that support in the doc.

Thanks for all your participating in this issue up till now!

brunobat commented 7 months ago

Sorry @jack-berg, isn't the next meeting on March 4? I have no meeting set for March 11 in the calendar.

jack-berg commented 7 months ago

Thanks @brunobat - yes that was a typo. March 4th is the correct date. Updated.

jack-berg commented 6 months ago

We discussed the proposals at the 3/4/24 SIG meeting, but didn't finish. We decided to add a supplementary meeting (normally we meet bi-weekly) on 3/11/24 at 8am PT to continue. Please attend if you want to participate. Thanks!

jack-berg commented 6 months ago

The configuration maintainers considered various proposals, and recommend proceeding with option 2.b as summarized in this document. TL;DR:

File configuration ignores the existing env var scheme when evaluating
Lean on / encourage users to use env var substiuttion
Introduce env var substitution default syntax (i.e. ${ENVVAR:defaultValue})
Introduce file configuration templates as a starting point for users, which use the existing env var scheme and their default values
Users can choose to remove env var references and lose the ability to override

A PR will be opened soon reflecting this recommendation. As mentioned, folks who disagree can escalate to a TC decision.

The proposals were evaluated against the following criteria. Our recommendation is based on a prioritization of these criteria combined with an evaluation of the degree to which each proposal aligns with each criteria. The criteria are in no particular order - all are important - but the bolded entries are particularly high priority.

File configuration users should be able to change SDK configuration without rebuilding an application.
File configuration users should have an intuitive experience based on industry standards and expectations.
Users should have a graceful migration path from the existing env vars scheme to file configuration.
File configuration should have the support of OpenTelemetry maintainers.
File configuration should be expressive enough to satisfy almost all SDK configuration use cases.
File configuration schema should mirror SDK programmatic configuration.
File configuration should be reasonably complex to implement.

We recommend option 2.b. Arguments are as follows:

We’re moving to YAML based file configuration because of frustration trying to express complex configuration in a flat env var scheme. Trying to have an env var override scheme for YAML config that is both exhaustive and ergonomic is circular logic: if it were possible / reasonable to express this kind of information in a flat environment scheme, we wouldn’t be here. And so we’re stuck choosing whether to prioritize an expressive YAML schema or ergonomic env var overrides. We believe we should look forward, not back, and prioritize modeling the YAML schema in the most natural way possible for each problem domain.

Option 2.b accomplishes this while being accommodating to other criteria:

Users wishing to change SDK configuration without rebuilding an application can do so. If the config was bundled with the app, they can copy its contents to a new file, make the changes, and point OTEL_CONFIG_FILE to that new location.
Users looking for env var overrides of YAML config shouldn’t be too surprised. Kubernetes users are familiar with manifest files to express config. And while there’s a syntax for patching resources, in my experience (Jack Berg) it’s unintuitive and I haven’t heard people using it in their workflows. The collector pretty exclusively uses file based configuration. There is a syntax for overriding select properties, but I haven’t heard of people using it and had to dig into source code to find out how it works, so I assume it’s not popular. Modern users of systems like kubernetes are quite familiar with using ConfigMaps to define configuration and mount the content in a file in a deployment. Modifying a ConfigMap which is maintained alongside your kubernetes env vars, and redeploying is a hardly different workflow than modifying env vars. And with option 2.b, there will be much less surprise since there is no error prone merge or new complex env var override syntax. What you see is what you get.
Users should have a graceful migration path from existing env var scheme to file configuration, and with templates and a “how to” guide, they will. Nothing is forcing a user to switch to file configuration, but when they do, their starting point will be a template referencing all the well known env vars. You might say that users are forced to migrate to file configuration by the existence of features (i.e. views) that do not exist in the existing env var scheme. But options like 1 and 3 don’t make those new features easier to specify with env var overrides. Those features don’t exist in the current env var scheme because it’s hard to express the information in a flat scheme, and that will continue to be true. With or without env var overrides, a user migrating to file config will be better off specifying their views in YAML than trying to override with env vars.
It leaves the door open for adding env var override scheme later, if user demand supports it. However, it will be difficult to justify later adding an env var override scheme which includes the existing env var scheme. It would be a bad user experience if today OTEL_TRACES_EXPORTER does nothing to file configuration and in a future version it produces a different result. A future env var override scheme would likely need to involve new env vars, like proposed in option 1.

The following summarizes why option 2.b was preferred to the other proposals:

Option 1: Refactoring the schema to ensure an ergonomic env var scheme is a hard pill to swallow. We can always add this type of scheme later if there is strong user demand.
Option 2.a: This is the same as the selected 2.b but without env var substitution defaults and without starter templates which reference the existing env var scheme. Adding these can only help adoption.
Option 2.c: The merge syntax will be a bit hard for users to grasp, add to implementation complexity, and most users will likely prefer to simply edit the config file directly.
Option 2.d: Existing tools exist for merging YAML files. We should encourage users to use these instead of burdening implementations with additional complexity.
Option 3: This options doesn’t align with the goals of the configuration file feature. We hope to provide a rich and expressive format to configure both simple and complex OTel setups. This option would limit that expressiveness from the start.

marcalff commented 6 months ago

We recommend option 2.b. Arguments are as follows:

I support this option, this is a good solution.

tedsuo commented 6 months ago

Just to clarify: with 2b, we are saying that existing env vars will continue to work, because the SDK will use them to supply values to any parameter not set in the config file, without the user having to first specify those env vars? They only have to specify the env vars if they want them to act as an override. Is that correct?

trask commented 6 months ago

Just to clarify: with 2b, we are saying that existing env vars will continue to work, because the SDK will use them to supply values to any parameter not set in the config file

no, as soon as you add a (say empty) config file, all of the existing (standard) env vars will stop working

MrAlias commented 6 months ago

Just to clarify: with 2b, we are saying that existing env vars will continue to work, because the SDK will use them to supply values to any parameter not set in the config file

no, as soon as you add a (say empty) config file, all of the existing (standard) env vars will stop working

True. We will likely need to publish a default config users can build from that includes all the existing environment variables.

tedsuo commented 6 months ago

Ok. I'll try to summarize my thoughts and to be as much in alignment with what is being proposed.

Defining a comprehensive env var system is both difficult and unnecessary for implementing file-based configuration. There is no need for us to tackle this right now.
Allowing env vars to automatically override config parameters is good industry practice, but it's exceedingly difficult for us to implement. Allowing users to define env var substitution is a good middle ground.

My only concern is that having a "switch" where existing env vars are all automatically disabled in the presence of a config file, and all automatically enabled when the config file is missing. I am concerned that such a switch will create a long tail of user issues. Imho, it would be better if there was no switch, even if the resulting solution was incomplete.

If, by default, existing env vars could be applied when they are valid, and ignored when they are not, it would be better than a switch in behavior. For example, OTLP_EXPORTER_ENDPOINT configures the OTLP endpoint, unless the config file has a list of OTLP exporters, or is otherwise setup in such a way that this would be an invalid parameter. In which case, a warning is printed and the env var ignored. In other words, we include a default mapping of env vars to config parameters, which would ignore any parameters which cannot be applied to the config file.

Since there isn't a full PR to look at, it's not clear to me how this relates to env var substitution, defaultValues, templates, and the rest of the 2.b proposal. Maybe some form of this is already present? My point is, a best effort to keep existing env vars working as expected would be very valuable, even if it does not represent a complete solution to every possible config situation.

trask commented 6 months ago

If, by default, existing env vars could be applied when they are valid, and ignored when they are not, it would be better than a switch in behavior. For example, OTLP_EXPORTER_ENDPOINT configures the OTLP endpoint, unless the config file has a list of OTLP exporters, or is otherwise setup in such a way that this would be an invalid parameter. In which case, a warning is printed and the env var ignored. In other words, we include a default mapping of env vars to config parameters, which would ignore any parameters which cannot be applied to the config file.

+1

jack-berg commented 6 months ago

Since there isn't a full PR to look at, it's not clear to me how this relates to env var substitution, defaultValues, templates, and the rest of the 2.b proposal. But a best effort to keep existing env vars working as expected would be very valuable, even if it does not represent a complete solution to every config situation.

The idea is to ignore all other env variables if OTEL_CONFIG_FILE is present, regardless of whether or not they can be applied without some sort of conflict. We would have starter config templates, which we encourage users to adopt, and which reference existing env vars and their defaults using the env var substitution syntax. For example, an abridged version of a starter template focusing on the default config for exporting in tracer provider might look like:

tracer_provider:
  processors:
    - batch:
         exporter:
           endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4318}
           protocol: ${OTEL_EXPORTER_OTLP_PROTOCOL:http/protobuf}
           headers: ${OTEL_EXPORTER_OTLP_HEADER}
           // and so on

You can wait for the full PR, but what you're describing is different than the recommendation we came to. We've been talking about this for months, and solicited a final set of proposals to consider, which we discussed over the course of the last two weeks. I'm not sure how exactly to handle additional proposals when we already went through this process and made a recommendation.

Consider escalating for a TC decision. Although, I will say that I think that the person who escalates to a TC decision should be responsible for thoroughly understanding the issue and discussions that have taken place up to this point, and organizing / presenting supporting materials to the TC. For this particular issue, there is quite a lot of context and discussion.

austinlparker commented 6 months ago

What was the rationale behind not going for option 1, just the overhead of having to define new env vars to match the new schema?

austinlparker commented 6 months ago

For what it's worth I'm generally predisposed to trust the process here, and I don't think 2b is an unreasonable process. I also think there's less of an issue with overrides here, because the config file is effectively a switch into a different way of doing it - if you opt-in to the new option, you can still use env var substitution, you just need to do those mappings in the file itself, right?

As an example of where this would simplify things a lot, the demo passes around a whole lot of env vars for config that could be consolidated into a single config file with per-service overrides.

codeboten commented 6 months ago

What was the rationale behind not going for option 1, just the overhead of having to define new env vars to match the new schema?

Ultimately there were two reasons for not going with number 1:

updating the config schema to make it easily compatible with flat environment variables was not trivial (ie removing complex objects that aren't easy to describe using env vars)
what you mentioned, the overhead of defining and entire new set of env vars would be cumbersome, but additionally this new env var wouldn't help reduce the confusion for end users. In the sense that users of the existing variables would still encounter issues that would cause them to have to switch to the new variables. Likely we would end up having to support both for a long period of time, and which is likely to cause confusion for users.

lmolkova commented 6 months ago

The idea is to ignore all other env variables if OTEL_CONFIG_FILE is present

I still perceive it as a breaking change. My OTEL_SERVICE_NAME worked and now it doesn't because I specified an extra env var (with file content I copied from ChatGPT or Stack Overflow).

While technically user does something incorrect here, no amount of documentation can prevent users shooting themselves in the foot. And if users feel that trust is broken, it'll be hard to repair it with RTFM.

I think it's reasonable that something provided in file explicitly overrides legacy env var (or vice versa) especially when it's accompanied with warning message. Is there a technical difficulty preventing us from supporting existing env vars for the foreseeable future? Seems like a small effort that mitigates all the possible risks, so why not?

jack-berg commented 6 months ago

I still perceive it as a breaking change. My OTEL_SERVICE_NAME worked and now it doesn't because I specified an extra env var (with file content I copied from ChatGPT or Stack Overflow).

Hmm.. this is a gray area, but I don't think this is a breaking change. Generally, I think we're obligated to have backwards compatibility with the behavior of env vars as long as the set remains consistent. I.e. if you define FOO and BAR, the behavior should be consistent across releases so long as you keep the same env vars. If we (i.e. the otel spec) later define a new behavior for BAZ, and you don't include it in your environment, the behavior should be consistent. But setting BAZ in your environment might change the behavior such that the vars FOO and BAR no longer do what they did before.

For a concrete example, consider OTEL_SDK_DISABLED, which is new compared to the other env vars. Previously, if you set OTEL_EXPORTER_TRACES=otlp, you'd expect traces to be exported to via OTLP. After OTEL_SDK_DISABLED was introduced, you still expect traces to be exported via OTLP, as long as you don't change your env vars. But as soon as you set OTEL_SDK_DISABLED=false, you override the behavior of other environment variables, and you no longer get traces exported over OTLP.

The introduction of OTEL_CONFIG_FILE is analogous to the introduction of OTEL_SDK_DISABLED.

Is there a technical difficulty preventing us from supporting existing env vars for the foreseeable future? Seems like a small effort that mitigates all the possible risks, so why not?

I disagree with this framing. We're providing an alternative, more expressive, and more exhaustive config mechanism. We're not removing support for env vars. A user should still be able to use them for the foreseeable future.

The difficulty is that its not possible to define merge rules that make sense all the time. #3840 attempted to define what they merge rules might look like, and there are examples earlier in this issue and in that PR about how merging breaks down. Actually, nobody proposed / defended this solution when solicited proposals. I realize that the idea of falling back to env vars when a thing is not present in file config is slightly different, but the problems with #3840 are overlapping.

lmolkova commented 6 months ago

When I set OTEL_SDK_DISABLED: true it's obvious that I'm disabling the SDK, but when I set OTEL_CONFIG_FILE it's not obvious at all that my OTEL_SERVICE_NAME will stop working.

I can also give you an example where service name is set by one party and config is provided by another - take Azure Functions where service name is the Function App name provided by the vendor and the config is a user concern. User might have no idea where the service name comes from and would be surprised when they decide to add the config.

When I read #3840 I see that merge rules are messy, but I don't understand how it justifies breaking users (even if users didn't read the doc). Can we try to come up with dead-simple rules that work in 80% of most important cases?

jack-berg commented 6 months ago

The obvious part is subjective. Consider that when you choose to set OTEL_CONFIG_FILE, you'll need to look for some resource to tell you what the contents of the file will look like. You'll be pointed to starter config templates that look something like:

resource:
  attributes:
    service.name: ${OTEL_SERVICE_NAME:unknown_service}
tracer_provider: ..

You'll copy this into your environment and point OTEL_CONFIG_FILE to the path. It shouldn't be entirely surprising that if you update it to replace the env var substitution syntax to a hardcoded value, that the env var reference you deleted is now ignored:

resource:
  attributes:
    service.name: foo
tracer_provider: ...

I believe it will be extremely common to use one of these starter config templates. Without one, users won't have any idea of what the structure looks like. The templates should contain references to the existing env vars, and comments describing the behavior that make it hard to misinterpret.

Can we try to come up with dead-simple rules that work in 80% of most important cases?

As mentioned here, I'm not sure how exactly to handle additional proposals when we already went through this process and made a recommendation. Consider escalating to a TC decision.

lmolkova commented 6 months ago

I think when this decision was made it did not take backward compatibility and mixed configuration case (above) into consideration fully.

You'll be pointed to starter config templates that look something like:

this is an assumption I don't share.

[Update] More examples of mixed configuration where OTEL_SERIVE_NAME, resource attributes and other config defaults are set by the environment in which user code runs.

otel-operator
AWS Lambda layers and in the official repo

So I believe env vars and config have to work together and cannot be mutually exclusive to allow cloud providers, otel operator and other environments provide convenience in a consistent manner.

trask commented 6 months ago

As mentioned https://github.com/open-telemetry/opentelemetry-specification/issues/3752#issuecomment-1998560172, I'm not sure how exactly to handle additional proposals when we already went through this process and made a recommendation.

I'd like to suggest that we bring this to the next Specification SIG meeting (March 26).

I think the Configuration SIG has done a great job of driving this and making a decision based on the input and proposals discussed. I also think that given the potential impact the decision will have, it would be worth presenting it to the broader Specification SIG and soliciting any additional feedback in case we find a different consensus or new perspectives on it.

I think that the person who escalates to a TC decision should be responsible for thoroughly understanding the issue and discussions that have taken place up to this point, and organizing / presenting supporting materials to the TC. For this particular issue, there is quite a lot of context and discussion.

I've been following the issue and discussions and can present and lead the discussion at the Specification SIG meeting.

In the likely event that the issue is still contentious after the Specification SIG meeting, it may be worth requesting a TC vote, if for no other reason than to put the issue to rest, and so that the Configuration SIG can move forward with confidence.

tedsuo commented 6 months ago

Thanks Trask. I also want to emphasize that I am not advocating to undo a bunch of work, and in general I trust the judgment of this SIG. I respect that you have really been putting a lot of effort into threading the needle on this design.

However, I do feel that this is really important, and the SIG should expect a public review. It does seem like the feedback has been fairly restricted to a particular point regarding critical env vars, so please don't consider this to be a call for a completely new approach.

If we dig deeper and refine this requirement further, it may be a smaller problem. Possibly all of the critical env vars are resources. That would be a lot simpler to deal with. I appreciate @lmolkova coming in with concrete examples.

austinlparker commented 6 months ago

Couldn't the 'breaking change' be mitigated by having a 'use new config' env var that made it explicit? Instead of just setting OTEL_CONFIG_FILE, you set OTEL_CONFIG_STYLE=file or something as well? Then it's unambiguous what behavior will take precedence.

jackshirazi commented 6 months ago

The chosen solution would work well with existing env vars if instead of saying that there is no default config, we provide a default config which includes all those existing env vars with their dafaults, and any config file specified is merged with the default. ie instead of the proposed template being optional, it is the default and any supplied config is merged to that default. I've seen other yaml config merges, it doesn't sound like it's a difficult thing to do but I don't have specific experience there. This setup would allow the full new config support while still maintaining almost complete backward compatibility

jack-berg commented 6 months ago

In the likely event that the issue is still contentious after the Specification SIG meeting,

I don't see how there wouldn't be. The opposing camps on this discussion remained opposed after months of discussion. There's no single insight to be uncovered that is going to make everyone agree. Given the time constraints of the spec sig and the complexity / nuance of this issue, I highly doubt it can be discussed in a way that won't interpreted as reductive by some. At least one camp is going to come away from this disappointed - there's no way around that.

The reasons not to raise in the spec SIG, and instead escalate to TC decision directly are:

This is the process we laid out previously. Changing the process after a recommendation is strange.
Extends the time to reach a decision, while not meaningfully increasing the probability of avoiding a TC decision.
Diminishes the work of the SIG, consumes cycles repeating same conversations.

codeboten commented 6 months ago

The chosen solution would work well with existing env vars if instead of saying that there is no default config, we provide a default config which includes all those existing env vars with their dafaults

@jackshirazi I could see this work in some of the cases, but it would get more complex in others. For example in the case of the OTEL_SERVICE_NAME, the following default would make sense:

resource:
  attributes:
    service.name: ${OTEL_SERVICE_NAME:unknown_service}

But in other cases (i.e. OTEL_LOGS_EXPORTER) it's no so obvious to me what the right thing to do is. The following could be the default config to support OTEL_LOGS_EXPORTER=otlp:

logger_provider:
  processors:
    - batch:
        exporter:
          otlp:

But as a user, how do I go about removing this default in my config? Or is the configuration provider going to only produce the default configuration based on the environment variables available? I feel like this is getting back into the same problem that @jack-berg's merge proposal ran into, but I could be wrong.

lmolkova commented 6 months ago

Couldn't the 'breaking change' be mitigated by having a 'use new config' env var that made it explicit? Instead of just setting OTEL_CONFIG_FILE, you set OTEL_CONFIG_STYLE=file or something as well? Then it's unambiguous what behavior will take precedence.

I have some examples (including otel operator and AWS lambda) here https://github.com/open-telemetry/opentelemetry-specification/issues/3752#issuecomment-1998845312 where I show that the configuration comes from multiple places and file vs env are not mutually exclusive. As long as they are mutually exclusive:

we're asking users to provide the fully resolved configuration
know all the env vars their environment sets by default and apply them
removes the standard env vars environments could set before and now they would likely not provide them as all since it's so easy for users to break it

If the env (such as AWS lambda) will provide the file, we're still asking AWS instrumentation and user app to resolve all the conflicts themselves.

I think it also addresses @jackshirazi comment. "we provide a default config which includes all those existing env vars " we provide does not mean that users use it and does not mean a great user experience or great integrations with the ecosystem.

lmolkova commented 6 months ago

To clarify, I do think that configuration SIG did an awesome job and yaml configuration is highly valuable. All I'm asking for is to experiment and see how the decision to drop env vars in presence of config will play out in real life e.g. with AWS lambda.

Given that OTEL_PROPAGATORS is literally part of FaaS semconv and the resource attributes in AWS lambda and otel operator are part of otel own's repos I think it's a fair ask.

danielgblanco commented 6 months ago

I think any of the options discussed previously would need opt-in from users and for them understand the side-effects, so anything that can be done in that respect via templates/documentation or runtime warnings (e.g. identifying supported vars present in env while parsing config file that are not present in config file for substitution) would go a long way in reducing end-user issues.

Without adding any new proposal to what's already been agreed, I'd like to add an extra data point for the recommended option 2.b (apologies if this has been discussed before). This option would easily enable use cases where the engineer in charge of defining a config file is not the owner of the service using that config file (e.g. platform engineers providing a default config to be used across their org) and may not want to expose all config options to their end-users, or may need to hardcode certain values for standardisation. It is maybe a niche case, and there are other ways of achieving this, but it would certainly help.

austinlparker commented 6 months ago

To clarify, I do think that configuration SIG did an awesome job and yaml configuration is highly valuable. All I'm asking for is to experiment and see how the decision to drop env vars in presence of config will play out in real life e.g. with AWS lambda.

Given that OTEL_PROPAGATORS is literally part of FaaS semconv and the resource attributes in AWS lambda and otel operator are part of otel own's repos I think it's a fair ask.

I feel like this isn't really accurate, though? It's not dropping env vars, the switch is literally just "do you want to use the old way or the new way". We could provide a default config that keeps the explicit existing mappings, no?

yurishkuro commented 6 months ago

@jack-berg I see comments referring to 2(b) but I don't see such label in the issue description. I would recommend updating the issue description with the "latest" view & recommendation so that people don't have to search through 90+ comments.

lmolkova commented 6 months ago

I feel like this isn't really accurate, though? It's not dropping env vars, the switch is literally just "do you want to use the old way or the new way". We could provide a default config that keeps the explicit existing mappings, no?

the point I'm trying to make that it breaks existing otel integrations. The moment user chooses OTEL_CONFIG_FILE in one way or another, env vars are ignored. Things like AWS lambda integrations set env vars. Can you please describe how they will work in the new world?

If the only answer is that every config users enable MUST have service.name: ${OTEL_SERVICE_NAME:unknown_service}, that's a problem:

It is a breaking change for all the existing envs that set env vars to provide users a default. E.g. all existing docs for AWS Lambda mention that there is a default (AWS layers configure using OTEL_SERVICE_NAME) and you can override it. OTEL_SERVICE_NAME used to work all the time and now it does not.
What if user removes or changes this line? Existing docs tell that AWS lambda sets something by default, why should I explicitly mention it in my config?

You can always argue with RTFM, but it feels like we could not figure out how to deal with merge conflicts and degraded existing user experience. I don't support it.

lmolkova commented 6 months ago

While I assume there are problems unique to otel, existing configuration frameworks all recognize that config comes from multiple sources and have conflict resolution/ordering as a part of them (MicroProfile, ASP.NET Core configuration).

So if otel is going to have first-class configuration story, it will eventually have to solve multi-source configuration merge.

yurishkuro commented 6 months ago

@lmolkova your definition of breaking is odd to me. It's like saying I want to drive Tesla but I want everything to remain as it was in Honda. Nobody is forcing the user to use the new mechanism. If they choose to use it, there's still a config file template that supports legacy env vars.

I prefer OTEL to provide a solution that is simple to maintain and debug but which is extensible. The config with substitution satisfies that. If someone wants to develop and overlay solution, it can be done as an add-on (library or utility), it does not need to block this simple/general solution.

tedsuo commented 6 months ago

How is respecting OTEL_SERVICE_NAME not simple? That seems incredibly simple. I'm confused at how this is being presented as an "us or them" thing. Why is it not possible to add this to the current proposal?

lmolkova commented 6 months ago

@yurishkuro can you explain how you can keep AWS Lambda or otel operator defaults working with the new config and approach updating all documents in the world that refer to OTEL_SERVICE_NAME?

My definition of breaking closely matches my company's one. And we are famous for providing back-compat to our users.

tedsuo commented 6 months ago

@lmolkova are there other critical env vars that you are aware of, or is it mostly service name?

austinlparker commented 6 months ago

the point I'm trying to make that it breaks existing otel integrations. The moment user chooses OTEL_CONFIG_FILE in one way or another, env vars are ignored.

This is the part that's confusing to me. They are not ignored, they just look to the config file for mapping rather than the old style behavior. It is a binary switch between "use the new format or use the old format". I would agree that it is a breaking change, but afaik it is not dissimilar to how .NET has done such things. For example, old versions of .NET Core used project.json, this was eventually changed to using .csproj. You could not use both at the same time, you had to switch, and there was a migration path.

lmolkova commented 6 months ago

@lmolkova are there other critical env vars that you are aware of, or is it mostly service name?

I think OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES are two most critical ones. We also refer to OTEL_PROPAGATORS in semconvs which I believe is critical for AWS Lambda and XRay - they set it. I wonder if @tylerbenson (who I believe is much more familiar with it than me) could comment on how critical it is to let AWS set it and how much will be broken if user accidentally adds incompatible config.

I'm not aware of other critical attributes. But I'd be interested to learn from owners of other environments known to configure OTel defaults such as opentelemetry-operator.

We also have friends in opentelemetry-dotnet who have mixed (file and env var) config since the beginning and can probably share their learnings.

tedsuo commented 6 months ago

Most users have a very simple SDK setup. A stock sampler, the batch processor, one exporter, and a handful of common resources. Given how common this setup is, continuing to respect common env vars that work with these scenarios would eliminate a lot of the need for boilerplate. It would also eliminate a lot of footgun situations, especially the ones where the platform is setting these attributes, not the end user. So there's value in providing some form of this even if it is a simple and incomplete solution.

@codeboten do you mind explaining what it is difficult about supporting a handful of these env vars as part of the proposed solution? I understand how a generalized merge solution is out of the question, but not why a couple of really simple rules triggered by the presence of certain env vars would be problematic.

lmolkova commented 6 months ago

the point I'm trying to make that it breaks existing otel integrations. The moment user chooses OTEL_CONFIG_FILE in one way or another, env vars are ignored.

This is the part that's confusing to me. They are not ignored, they just look to the config file for mapping rather than the old style behavior. It is a binary switch between "use the new format or use the old format". I would agree that it is a breaking change, but afaik it is not dissimilar to how .NET has done such things. For example, old versions of .NET Core used project.json, this was eventually changed to using .csproj. You could not use both at the same time, you had to switch, and there was a migration path.

For the sake of simplicity, let's focus on functional limitations rather than breaking aspect.

I'm AWS Lambda and otel user.
I followed AWS lambda docs to enable otel. I was happy with the defaults: service name, propagator, exporer (to xray or default otlp)
I'm a happy user and everything works.
Now I update otel to the new version. I see that I can specify the config now and customize more
I add the config.
- I'm a user who doesn't read docs - I copy one config property that sets otlp exporter timeout. Everything falls apart. Surprising.
- I'm user who reads official docs. I copy the whole sample config with tons of boilerplate service.name: ${OTEL_SERVICE_NAME:unknown_service} and wonder if it's all necessary. I try deleing it and everything falls apart. Another surprise.

Now I'm a cloud provider.

I notice otel added file config option and realize that my way to give users defaults no longer works if they provide the file
It's likely that I got a report or two from users who tell me that default config no longer works
I don't know how to give users the defaults that's not fragile. Do I give them a file with defaults to modify or they give me a file?
Eventually I understand that I can use X-MS-FOO-BAR instead of OTEL_SERVICE_NAME and will tell users to just set it in their configs service.name: ${X-MS-FOO-BAR}.
A year from now nobody uses OTEL_SERVICE_NAME because it's fragile

Let me know if it clarifies my point.

codeboten commented 6 months ago

do you mind explaining what it is difficult about supporting a handful of these env vars as part of the proposed solution

@tedsuo i don't think support some of the variables is hard, the complexity really only comes in if we're talking about supporting all the variables that exist today as they don't currently map well to the structure of the configuration schema today. The question then becomes is it ok to only support a few variables, or will end users expect that all the existing variables continue to work as they did before, even when using a configuration file.

I'm not sure that we've explored this idea that only a subset of existing variables need to be supported, my impression was that it was important to support all of them.

austinlparker commented 6 months ago

I feel like creating a mix of 'these variables work all the time' and 'these variables work only some of the time' is worse, though?

tylerbenson commented 6 months ago

Maybe it's worth considering a small list of non-complex environment variables that must be honored with priority over the file config?

codeboten commented 6 months ago

I feel like creating a mix of 'these variables work all the time' and 'these variables work only some of the time' is worse, though?

This is largely why I personally had not considered this option, because then I have to remember what works when. This also led me to not voting in favour of option 1 (new env var schema) because similarly, we would have ended up with a bunch of new variables that work and a bunch of old variables that do not.

Knowing which is which becomes painful. Although maybe not more painful than not support any variables without templating out of the box.

austinlparker commented 6 months ago

This is largely why I personally had not considered this option, because then I have to remember what works when. This also led me to not voting in favour of option 1 (new env var schema) because similarly, we would have ended up with a bunch of new variables that work and a bunch of old variables that do not.

Well, the advantage of option 1 is an explicit v1/v2 config. If v1 options are passed, v2 options are ignored. /shrug

tedsuo commented 6 months ago

Thanks @codeboten. I think we've narrowed it down to the main issue being just the env vars that platforms set. I think it's fine for end users to be instructed to explicitly add every env var they want to use directly into their config file, and to mention that these specific env vars can also be applied automatically in order to allow for platform support. Framing it that way would limit expectations.

Providing copy/paste templates is a very nice and helpful thing to do. I'm glad we're doing that. But it would be a bummer if we turned that convenience around on our users, and instead said you must always have a bunch of boilerplate just in case a platform needs it. If a handful of env vars gets rid of that problem for 90% of users, I think it is worth the small amount of confusion it might cause for a small number of users.

austinlparker commented 6 months ago

Providing copy/paste templates is a very nice and helpful thing to do. I'm glad we're doing that. But it would be a bummer if we turned that convenience around on our users, and instead said you must always have a bunch of boilerplate just in case a platform needs it. If a handful of env vars gets rid of that problem for 90% of users, I think it is worth the small amount of confusion it might cause for a small number of users.

Config files can be merged, right? Users wouldn't need to supply boilerplate, if someone's using a managed platform then they could document the existing config options.

Again, I think it would be an extremely poor user experience to have some "magic" env vars that continue to work the way they always have, regardless of how few there are. If you're using the new config, everything should follow the new model. Special casing a handful of variables is going to inevitably result in bikeshedding over which env vars are special enough to get grandfathered in.

edit -- my reference to merged config files is in the sense that tooling exists to merge yaml config together, I presume that something to accomplish this will be part of the tool chain for managed environments

ocelotl commented 6 months ago

haven't read all the comments here, but do we need to make environment variables stop working if a config file is being used? We can have a config file and keep environment variables working, in fact, it is what would happen if we used a configuration file now in the Python prototype.

ocelotl commented 6 months ago

More details here

open-telemetry / opentelemetry-specification

Should file configuration merge environment variable configuration? #3752