Configuration reloading needs rethinking from error handling perspective

tigrannajaryan commented 1 year ago

This was previously raised in comments here.

Currently if the resolved config is invalid we shutdown the Service and exit the process. This works fine for Collectors with local configuration sources. It is an expected behavior to fail fast and show where the problem is when you try to start the Collector.

On the contrary the current behavior is ill-suited for remote configuration management implemented via confmap.Provider. The config changes in this case are often unattended (config is rolled out at scale and no-one is typically baby-sitting individual Collectors) and if anything goes wrong during the config update the Collector is seen as suddenly crashing and losing the connection to the remote config source and losing any chance to recover.

Instead of this we may want to have a more desirable behavior:

When as a result of a triggered watch func the config is reloaded and the resulting config is invalid we may want to avoid restarting the Service at all. Instead we may want to be able to report the fact of the invalid config to the confmap.Provider so that it can report back to the remote config source. This requires validating the config first and only after that trying to restart the Service.
If config is valid but Service.Start() function fails we may want to report the failure to confmap.Provider.
If new config loading fails for whatever reasons we may want to do one of these things:
- Retry loading the new config a few times, with exponential backoff. This can be useful if a transient condition prevents the Start().
- Roll back to the last known good config (the last one which resulted in a successful Service.Start()). If Start() fails with the last known good config as well (which may happen if some external condition is different) retry it with exponential backoff until it succeeds.

An alternate to this is to say confmap.Provider is not recommended to be used with watchable remote config sources that can change the config while the Collector is unattended. Instead we can recommend that such watchable remote config source should only be used with external Supervisor processes which take care of restarting failed Collector processes. This is a valid approach if we want to go with this, but we need to be explicit that the current confmap.Provider design is not suited for such uses cases.

tigrannajaryan commented 1 year ago

@open-telemetry/collector-approvers @amdprophet @portertech @Aneurysm9 any thoughts on this?

bogdandrutu commented 1 year ago

What is config.Provider? We have 2 terms, but none of them 100% match confmap.Provider and service.ConfigProvider.

tigrannajaryan commented 1 year ago

What is config.Provider? We have 2 terms, but none of them 100% match confmap.Provider and service.ConfigProvider.

Oops, sorry. I mean confmap.Provider. Updated the description.

bogdandrutu commented 1 year ago

There are multiple things that you are touching here, and independent of the opamp and remote control in general:

What to do when (independent of the configuration source/path) we are unable to receive/unmarshal/deploy a new configuration OR the first configuration? I think this is important thing to resolve independent of the source.
How does a remote control protocol get notify in case of an error. The answer to what to do in case of an error I think should be the same as 1.

Should we treat these things separately?

tigrannajaryan commented 1 year ago

There are multiple things that you are touching here, and independent of the opamp and remote control in general

I think this is fair. I was approaching this from the remote configuration perspective but you are right, an argument can be made that these are also related concerns even with a local file config. For example: what happens if the machine is restarted and the Collector cannot start because the port it tries to listen on is not available. We current fail the entire Collector. Should this be the case or we may want a behavior more similar to what I described with retrying, etc.

How does a remote control protocol get notify in case of an error

BTW, this may also be important if there is no remote configuration enabled, but there is status reporting to a remote destination. OpAMP for example allows this, you can choose to have status reporting only even if the agent is not receiving the config from remote sources. OpAMP allows to report the effective config and the health of the agent. I believe this can be still very valuable even when you are using a local file as the config.

bogdandrutu commented 1 year ago

Not saying it is not, but I think they are 2 separate problems that we need to discuss.

codeboten commented 1 year ago

In the retry case, this is mentioned:

Roll back to the last known good config

How would this work in the following case where the machine is restarted? Would the good config be cached somewhere that persists machine restarts?

For example: what happens if the machine is restarted and the Collector cannot start because the port it tries to listen on is not available.

It would be really helpful to have a diagram of what happens in all these various cases, would be great for documentation and to help make sense of all the permutations :)

bogdandrutu commented 1 year ago

@codeboten agree. So we need a design proposal with the state machine (maybe) for the case where loading a config fails or the loaded config is invalid.

/cc @swiatekm-sumo who also talked about this.

I feel we need a champion to think about this problem and propose a state diagrams (transitions). Agree on that, then we can discuss how to achieve that in code.

tigrannajaryan commented 1 year ago

@bogdandrutu @codeboten Here is the proposed transition diagram, assuming we get rid of ReportFatalError:

cc @open-telemetry/collector-approvers

bogdandrutu commented 1 year ago

I think that restoring the "last known good config" is against all SRE principles (you have different configs, kind of random running in production), so I am ok to have that only if we default to not do that and exit even in the case of an error when "last known config" is not empty.

tigrannajaryan commented 1 year ago

I think that restoring the "last known good config" is against all SRE principles (you have different configs, kind of random running in production), so I am ok to have that only if we default to not do that and exit even in the case of an error when "last known config" is not empty.

I agree with not restoring the "last known good config". But do we need to exit? We can keep trying to Start() (maybe it is a transient error) and we can keep waiting for a new config. In the meantime if the confmap.Provider is using OpAMP it can report that something bad happened and the operator can intervene.

Also, independently from the above, one more thing: you said restoring the last config is bad because you can have different configs running in production. But we are going to have some config differences across the fleet no matter what. This can happen if:

The particular agent did not receive its config from the remote source (maybe there is no connection).
The agent received the config but it does not pass the validation (bad yaml or unsupported settings). Are you suggesting that we don't validate the config? Or we validate and knowingly accept it and stop the Collector?

Does this negate the argument against restoring the config or you still think it is a bad idea?

Anyway, I don't mind eliminating the "restore" step if we believe it is harmful. Maybe it can be an option that the user can select?

swiatekm commented 1 year ago

I think that restoring the "last known good config" is against all SRE principles (you have different configs, kind of random running in production), so I am ok to have that only if we default to not do that and exit even in the case of an error when "last known config" is not empty.

It's not exactly the same thing, but it's definitely normal to halt the rollout of a bad config change, and most orchestrators will do so automatically. The end result of that is that we have the old config live even though a new one was rolled out, which is what we're talking about here.

I think the argument for making this optional is right though, as the user will typically want their orchestrator to manage this. But if they want to use OTRM without an orchestrator, they definitely don't want the collector to terminate on a bad config.

open-telemetry / opentelemetry-collector

Configuration reloading needs rethinking from error handling perspective #6226