open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.12k stars 394 forks source link

Operator takes a long time to reacquire lease after previous leader exits #3058

Open swiatekm-sumo opened 1 week ago

swiatekm-sumo commented 1 week ago

Component(s)

No response

Describe the issue you're reporting

When leader election is enabled for the operator, it takes upwards of two minutes before the new leader acquires the lease. Leader election is enabled by default so this affects even single-replica Deployments.

A proposed fix would be to release the lease on manager exit by enabling LeaderElectionReleaseOnCancel https://github.com/kubernetes-sigs/controller-runtime/blob/8290d13680ed6066d9789c0ed59ff604387e20da/pkg/manager/manager.go#L202. This can apparently be unsafe in some circumstances, but we immediately exit on manager exit, so we should be fine.

fyuan1316 commented 3 days ago
image

I just had this happen to me, deleting the operator pod and recreating it should reproduce it. But it does take a bit longer.