Make leader-for-life leader election more integrated with controller-runtime

joelanford commented 3 years ago

Feature Request

Is your feature request related to a problem? Please describe. Yes. It isn't possible to use leader-for-life leader election with controller-runtime's manager when also using liveness and readiness probes.

Using controller-runtime's manager out of the box, the following sequence of events happens when manager.Start() is called:

Liveness and readiness probes are started
Leader election is started.
Controllers are started.

When using leader-for-life from this repo, it must be called prior to manager.Start() since controller-runtime doesn't support pluggable leader election implementations. The sequence of events in this case is:

Leader election is started.
Liveness and readiness probes are started
Controllers are started.

Notice that 1) and 2) are swapped. This swap causes deadlocks when upgrading operator deployments that use leader-for-life. When the deployment is attempting to rollout a new version, the new pod starts up and first attempts to become the leader, failing indefinitely until the old pod relinquishes ownership. However the old pod will not relinquish ownership until it disappears and it won't disappear until the new pod reports that it's healthy. Unfortunately the new pod will never be able to report that it's healthy because it needs to be the leader before it starts its liveness and readiness probe servers.

Describe the solution you'd like To work upstream to make controller-runtime support a pluggable leader election implementation such that leader-for-life can be used by the manager.

estroz commented 3 years ago

I'd like to suggest deprecating this package in favor of controller-runtime/pkg/leaderelection, or at least make a note that it has this bug until it is fixed to deter users. client-go's leader-with-lease (and controller-runtime's wrapper) are quite stable and easy to use now (they were not back when this leader-for-life library was originally written), and even though it does not guarantee no overlap between elections it seems to be the de-facto standard upstream.

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

joelanford commented 3 years ago

/lifecycle frozen

erikgb commented 2 years ago

Anyone working on this? What I would love to see, is this leader-for-life feature available in controller-runtime! A pluggable leader election mechanism could be useful on it's own, but I think getting leader-for-life into controller-runtime would be more sustainable.

operator-framework / operator-lib

Make leader-for-life leader election more integrated with controller-runtime #48

Feature Request