Leader-elected mode not enabled for existing controllers

mgencur commented 4 years ago

When leader election is enabled at runtime in the config-leader-election ConfigMap the existing pods do not apply these settings. They continue running in the normal mode and later when more pods are started for a controller and the new leader is elected this original controller keeps reconciling knative resources, together with any new leader This leads to race conditions. The user has to manually restart the pods after applying those settings.

What version of Knative?

0.13.x

Expected Behavior

The leader election mode is reliably enabled after changing the config-leader-election ConfigMap.

Actual Behavior

The existing pod keeps running in non-leader-elected mode and keeps reconciling services together with any new leader controller.

Steps to Reproduce the Problem

install Knative Serving and an arbitrary knative service (this will have leader election disabled by default)
enable leader election in config-leader-election configmap for Controller
scale up the Controller deployment
see that both controllers are reconciling the knative service, one controller is the leader and the original one does NOT run in leader elected mode

vagababov commented 4 years ago

While this is unfortunate, this is a major change in mode of operation of the system and I guess asking for a restart is reasonable (I think directly switching to HA mode would be a bit hard to implement). Also related to Matt's latest doc. /cc @mattmoor @pmorie

mattmoor commented 4 years ago

TBH, my preference would be to get enough mileage through the leader election code that we can just enable it by default and remove that moving part.

mgencur commented 4 years ago

We could start with a PR that enables it by default, uses just one pod for every controller (same as now) and runs all test suites in this mode except for HA tests where it would be enough to add one more pod which would automatically run in leader-election mode.

markusthoemmes commented 4 years ago

Any objections on enabling this by default for all components now?

mgencur commented 4 years ago

Linking the other issue that is related: https://github.com/knative/pkg/issues/1319 It will affect implementation significantly.

mgencur commented 4 years ago

The first impl is here: https://github.com/knative/serving/pull/8205 . We should consider knative/pkg#1319 before merging it.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

knative / serving