kontena / pharos-host-upgrades

Kube DaemonSet for host OS upgrades
Apache License 2.0
41 stars 1 forks source link

Retry kube lock acquire until --schedule-window expires #22

Closed SpComb closed 6 years ago

SpComb commented 6 years ago

Run each scheduled task with a context.Context, using the --schedule-window=1h option to set a deadline for the task execution.

Fixes #1: kube/Lock.Acquire takes a context.Context and times out if the lock is not freed before the context expires.

This required re-implementing the k8s.io/client-go/util/retry/RetryOnConflict due to a bug with watch timeout errors, which caused the upgrades to run without the lock held in case the Acquire => retry => wait timed out: https://github.com/kubernetes/client-go/issues/427

Fixes #20: the top-level Kube.AcquireLock() retries the kube/Lock.Acquire until it either succeeds, or the context deadline expires. This also handles the master being down, with crude exponential backoff.

SpComb commented 6 years ago

Testing that the new kube/Lock.modify update conflict retry works:

2018/05/31 11:40:00 Acquiring kube lock...
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=: free
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: acquire
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: set pharos-host-upgrades.kontena.io/lock=centos-7
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: update
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: retry modify conflict: Operation cannot be fulfilled on daemonsets.apps "host-upgrades": the object has been modified; please apply your changes to the latest version and try again
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=ubuntu-xenial: locked
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: watch v1.ListOptions{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, LabelSelector:"", FieldSelector:"metadata.name=host-upgrades", IncludeUninitialized:false, Watch:false, ResourceVersion:"55656", TimeoutSeconds:(*int64)(nil), Limit:0, Continue:""}
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=: free
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: wait ok
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: acquire
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: set pharos-host-upgrades.kontena.io/lock=centos-7
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: update
SpComb commented 6 years ago

Testing that the kube/Lock.Acquire timeout works:

2018/05/31 11:39:00 Schedule run started, deadline at 2018-05-31 11:39:10.000963736 +0000 UTC m=+89.629437537
2018/05/31 11:39:00 Acquiring kube lock...
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=: free
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: acquire
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: set pharos-host-upgrades.kontena.io/lock=ubuntu-xenial
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: update
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: retry modify conflict: Operation cannot be fulfilled on daemonsets.apps "host-upgrades": the object has been modified; please apply your changes to the latest version and try again
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=centos-7: locked
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: watch v1.ListOptions{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, LabelSelector:"", FieldSelector:"metadata.name=host-upgrades", IncludeUninitialized:false, Watch:false, ResourceVersion:"55538", TimeoutSeconds:(*int64)(nil), Limit:0, Continue:""}
2018/05/31 11:39:10 kube/lock kube-system/daemonsets/host-upgrades: wait err: timed out waiting for the condition
2018/05/31 11:39:10 Acquiring kube lock failed, retrying: timed out waiting for the condition
2018/05/31 11:39:11 Failed to acquire kube lock: context deadline exceeded
SpComb commented 6 years ago

Testing that the top-level Kube.AcquireLock retry works in the case of the kube API being down:

2018/05/31 11:41:00 Schedule run started, deadline at 2018-05-31 11:41:10.000908342 +0000 UTC m=+86.407412249
2018/05/31 11:41:00 Acquiring kube lock...
2018/05/31 11:41:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:41:00 Acquiring kube lock failed, retrying: Get: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/host-upgrades: dial tcp 10.96.0.1:443: connect: connection refused
2018/05/31 11:41:01 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:41:01 Acquiring kube lock failed, retrying: Get: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/host-upgrades: dial tcp 10.96.0.1:443: connect: connection refused
2018/05/31 11:41:03 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:41:03 Acquiring kube lock failed, retrying: Get: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/host-upgrades: dial tcp 10.96.0.1:443: connect: connection refused
2018/05/31 11:41:07 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:41:07 Acquiring kube lock failed, retrying: Get: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/host-upgrades: dial tcp 10.96.0.1:443: connect: connection refused
2018/05/31 11:41:15 Failed to acquire kube lock: context deadline exceeded