open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.21k stars 439 forks source link

Specify the collector configuration in an external ConfigMap #2871

Closed matthagenbuch closed 5 months ago

matthagenbuch commented 6 months ago

Component(s)

collector

Is your feature request related to a problem? Please describe.

When changing the configuration of a collector, I have no option for a controlled rollout / rollback process in the case of an error. Once the new configuration is applied to my OpenTelemetryCollector resource, the old configuration is overwritten and all running collector pods now see the new configuration.

This makes it difficult to control the rollout process, as any pod that restarts for any reason will load with the new configuration. It also makes it impossible to roll back to the previous configuration without again updating the OpenTelemetryCollector resource.

Describe the solution you'd like

For other services, we "version" the configuration in ConfigMap resources. Each time a configuration is updated, a new ConfigMap with a unique name is created and the deployment/statefulset is updated to reference it. This ensures that running pods see no change in configuration, and rolling back can be done by an external tool such as Argo Rollouts since the old configuration is still on the cluster.

Ideally, I'd like to be able to specify the OpenTelemetryCollector configuration by passing in a ConfigMap, which I manage myself. This would allow me to use the same versioning / rollback system as for other services I own.

Describe alternatives you've considered

The OpenTelemetry Operator could automatically create and manage versioned ConfigMaps for the collector configuration. This might be a useful OOTB feature for some users, but in my case I would prefer to manage ConfigMaps external to the operator in order to guarantee consistency with my other services.

Additional context

There is a similar open issue for specifying multiple configurations to be merged. The solutions proposed there which involve specifying an external ConfigMap would also resolve this feature request.

Currently, the operator does have a configmaps field which could be used to mount the collector configuration. However, the operator explicitly doesn't support overriding the --config arg, so we can't use these ConfigMaps for collector configuration.

swiatekm commented 6 months ago

I don't see any obvious reason not to do this. Part of me does want to go the versioned configmap route, as our plans for the operator's future go more in the direction of providing higher-level abstractions on top of the raw collector config. Letting users use an external ConfigMap solves this specific problem, but will also potentially lock them out of distributed configuration and other similar features. But these solutions aren't mutually exclusive. If you'd like to submit a PR to add external ConfigMap support, I'd be willing to accept it. WDYT @jaronoff97 @pavolloffay ?

jaronoff97 commented 6 months ago

I think adding support for allowing the overriding of the config makes sense. I also think this probably gets easier in a world with distributed configuration, though I don't think we need to wait for that world.

CarlosLanderas commented 5 months ago

@matthagenbuch Thanks for this contribution, I think it's a total must.

Using a unique configmap can have serious consequences on previous working versions once a faulty config is deployed when mutating the Otel Col.

We found out, using the default maxSurge of maxUnavailable of 25%, the amount of pods replaced with failed configuration causes the working pods to have increated pressure, eventually going out of resources and restarting adopting the new faulty configuration.

This can cause a nice fallout and wipe all the working collectors.

I was specifically looking for this and found this issue.

matthagenbuch commented 5 months ago

We found out, using the default maxSurge of maxUnavailable of 25%, the amount of pods replaced with failed configuration causes the working pods to have increated pressure, eventually going out of resources and restarting adopting the new faulty configuration.

@CarlosLanderas Yes this is a particularly nasty failure mode in the current implementation. I'm working on a PR to add support for multiple versioned ConfigMaps, which should prevent such failures from propagating. When investigating, I found that specifying an external ConfigMap (as mentioned in this issue) would break other operator functionality which relies on the collector config at the time of provisioning.

CarlosLanderas commented 5 months ago

Yeah @matthagenbuch, saw the PR that's why I left the comment :).

Thanks for the contribution!

PD: Is there any chance to overcome the limitation you mention?

When running the collector in prod at scale this is very problematic.

For now, we are gonna rely on k8s integration tests before deploying a new version config.

jaronoff97 commented 5 months ago

@CarlosLanderas would something like this issue be helpful?

CarlosLanderas commented 5 months ago

@CarlosLanderas would something like this issue be helpful?

@jaronoff97, does this mean the operator will run config validations before patching the configmap and deployment/daemonset?

In that case, it will definitely help :)

pavolloffay commented 5 months ago

Unfortunately the operator does not validate collector config.

rmvangun commented 3 months ago

@jaronoff97 Curious how #2946 resolves this issue? How does one specify a ConfigMap alternative to the config: | parameter?

In my use case, I need some ability to dynamically compose the configuration prior to passing it to the OpenTelemetryCollector. I'd like to do so using declarative constructs like Kustomize, which I can use to generate a configmap resource dynamically. If there are other ways of handling this I'm open to suggestions.

jaronoff97 commented 3 months ago

@rmvangun we have a separate issue for this #1906 which is closer to what you're looking for. We're still evaluating designs for that, so please comment on that issue with ideal use cases for you! :bow: thank you!