Master upgrade reboot interrupts lock waits, causing simultaneously scheduled upgrades on other nodes to be skipped

SpComb commented 6 years ago

If a kube master host upgrades and reboots, it will interrupt any simultaneously scheduled worker hosts upgrades, because the kube API watch for the upgrade lock will fail:

2018/05/30 10:06:00 Acquiring kube lock...
2018/05/30 10:06:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/30 10:06:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/30 10:06:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=ubuntu-xenial: locked
2018/05/30 10:06:00 kube/lock kube-system/daemonsets/host-upgrades: watch v1.ListOptions{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, LabelSelector:"", FieldSelector:"metadata.name=host-upgrades", IncludeUninitialized:false, Watch:false, ResourceVersion:"4510", TimeoutSeconds:(*int64)(nil), Limit:0, Continue:""}
2018/05/30 10:07:00 Scheduler is busy, skipping scheduled run
E0530 10:07:21.684926       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=51, ErrCode=NO_ERROR, debug=""
2018/05/30 10:07:21 Failed to acquire kube lock: watch closed before Until timeout

The host-upgrades pod will be restarted, but it will miss that scheduled upgrade. With simultaneously scheduled daily upgrades on both master and worker nodes, the worker nodes may thus only install their upgrades the following day, depending on the randomized ordering of whether the master node(s) happen to acquire their locks first.

The kube locking should probably retry the watch - for a HA setup this should immediately succeed, but for a single-master setup, the retry needs to be patient enough to allow the master node to complete its reboot.

SpComb commented 6 years ago

This seems to be covered by https://github.com/kubernetes/kubernetes/issues/31345, with a new RetryWatcher in https://github.com/kubernetes/kubernetes/pull/50102... that seems to retry the Watch for any and all errors, but it does that by tracking the lastModifiedVersion, so that it doesn't need to retry the Get...

Retrying the Get for arbitrary errors doesn't seem like a good idea, because then it will get stuck on persistent errors like missing ClusterRole rules...

SpComb commented 6 years ago

Thinking about it more, it would make sense for the top-level Kube.AcquireLock() to retry the locking regardless of any kube errors... that implementation would be related to #1 as a high-level functionality, so with --schedule-window=2h it should retry for up to two hours.

kontena / pharos-host-upgrades

Master upgrade reboot interrupts lock waits, causing simultaneously scheduled upgrades on other nodes to be skipped #20