kontena / pharos-host-upgrades

Kube DaemonSet for host OS upgrades
Apache License 2.0
41 stars 1 forks source link

Basic --reboot support #14

Closed SpComb closed 6 years ago

SpComb commented 6 years ago

Support --reboot to reboot the host immediately after upgrading if required. The reboot happens with the lock held, waiting for the pod to come back up before releasing the lock and allowing other nodes to continue.

This PR does not yet cover draining the kube node.

The reboot process works by:

  1. Commanding a reboot via the systemd-logind dbus API (no error return...?)
  2. Leaving the kube lock acquired, and expecting the pod to get terminated before the next schedule run
  3. Releasing the kube lock when starting up, if it was acquired by the same node

TODO

SpComb commented 6 years ago

The shutdown process has some weaknesses: There are no ordering dependencies for the docker.service and kubelet.service units, so the shutdown ordering is not deterministic... testing with KillMode=none on the kubelet.service to widen the race window by leaving the kubelet running, it seems like the kubelet will attempt to restart the docker containers that get terminated during the shutdown... this is bad for step 3, because the restarted pod would end up releasing the kube lock before the shutdown is complete and the host has rebooted

May 25 12:10:04 ubuntu-xenial systemd-logind[1084]: System is rebooting.
...
May 25 12:10:04 ubuntu-xenial systemd[1]: Stopping Docker Application Container Engine...
...
May 25 12:10:04 ubuntu-xenial systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
...
May 25 12:10:05 ubuntu-xenial kubelet[1068]: E0525 12:10:05.196431    1068 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=733, ErrCode=NO_ERROR, debug=""
May 25 12:10:05 ubuntu-xenial kubelet[1068]: E0525 12:10:05.196749    1068 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=733, ErrCode=NO_ERROR, debug=""
...

May 25 12:10:15 ubuntu-xenial dockerd[1086]: time="2018-05-25T12:10:15.438400059Z" level=error msg="Handler for POST /v1.31/containers/fa3570ef03f3666cd944223ff27ebce10b6b28d8d212daeb2e0a3150e1aa1797/start returned error: failed to update store for object type *libnetwork.endpoint: open : no such file or directory"
May 25 12:10:15 ubuntu-xenial kubelet[1068]: E0525 12:10:15.440223    1068 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-dns-86f4d74b45-n5s96": Error response from daemon: failed to update store for object type *libnetwork.endpoint: open : no such file or directory
May 25 12:10:15 ubuntu-xenial kubelet[1068]: E0525 12:10:15.440377    1068 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "kube-dns-86f4d74b45-n5s96_kube-system(22dc5a77-5ea0-11e8-9792-02a3be24b14f)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-dns-86f4d74b45-n5s96": Error response from daemon: failed to update store for object type *libnetwork.endpoint: open : no such file
May 25 12:10:15 ubuntu-xenial kubelet[1068]: E0525 12:10:15.440437    1068 kuberuntime_manager.go:646] createPodSandbox for pod "kube-dns-86f4d74b45-n5s96_kube-system(22dc5a77-5ea0-11e8-9792-02a3be24b14f)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-dns-86f4d74b45-n5s96": Error response from daemon: failed to update store for object type *libnetwork.endpoint: open : no such fil
May 25 12:10:15 ubuntu-xenial kubelet[1068]: E0525 12:10:15.440700    1068 pod_workers.go:186] Error syncing pod 22dc5a77-5ea0-11e8-9792-02a3be24b14f ("kube-dns-86f4d74b45-n5s96_kube-system(22dc5a77-5ea0-11e8-9792-02a3be24b14f)"), skipping: failed to "CreatePodSandbox" for "kube-dns-86f4d74b45-n5s96_kube-system(22dc5a77-5ea0-11e8-9792-02a3be24b14f)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-86f4d74b
May 25 12:10:15 ubuntu-xenial kubelet[1068]: E0525 12:10:15.740342    1068 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list *v1.Node: Get https://10.0.2.15:6443/api/v1/nodes?fieldSelector=metadata.name%3Dubuntu-xenial&limit=500&resourceVersion=0: dial tcp 10.0.2.15:6443: getsockopt: connection refused
May 25 12:10:15 ubuntu-xenial kubelet[1068]: E0525 12:10:15.941682    1068 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list *v1.Service: Get https://10.0.2.15:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.0.2.15:6443: getsockopt: connection refused
May 25 12:10:16 ubuntu-xenial kubelet[1068]: E0525 12:10:16.142792    1068 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.0.2.15:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dubuntu-xenial&limit=500&resourceVersion=0: dial tcp 10.0.2.15:6443: getsockopt: connection refused
May 25 12:10:16 ubuntu-xenial systemd[1]: Stopped Docker Application Container Engine.
...
May 25 12:10:16 ubuntu-xenial kubelet[1068]: E0525 12:10:16.397977    1068 docker_sandbox.go:236] Failed to stop sandbox "0661cd1f5395774a20b2213b85157e90c98a47fc173eadd69177be9faee33146": Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
...
SpComb commented 6 years ago

Interesting... looks like I actually got a pod restart during shutdown, with the kubelet.service configured using KillMode=none to maximize the non-determinism... I wouldn't expect this to ever happen in practice, though:

vagrant@ubuntu-xenial:~$ kubectl -n kube-system logs -f host-upgrades-sn2vv -p -p
2018/05/25 13:15:07 Load config from --config-path=/etc/host-upgrades
2018/05/25 13:15:07 Copying configs to --host-mount=/run/host-upgrades
2018/05/25 13:15:07 hosts/ubuntu probe failed: hostname1.GetProperties: Refusing activation, D-Bus is shutting down.
2018/05/25 13:15:07 hosts/centos probe failed: hostname1.New: read unix @->/var/run/dbus/system_bus_socket: read: connection reset by peer
2018/05/25 13:15:07 Failed to probe host: No hosts matched
May 25 13:15:07 ubuntu-xenial kubelet[1077]: I0525 13:15:07.135933    1077 kuberuntime_manager.go:757] checking backoff for container "host-upgrades" in pod "host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)"
May 25 13:15:07 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:07Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/b92f070d1407b12f6d409e5f2799a7ea4b5638c0c332bc32491d11546c681219/shim.sock" debug=false module="containerd/tasks" pid=13664
May 25 13:15:07 ubuntu-xenial dbus[1100]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
May 25 13:15:07 ubuntu-xenial dbus[1100]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.hostname1.service': Refusing activation, D-Bus is shutting down.
May 25 13:15:07 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:07Z" level=info msg="shim reaped" id=b92f070d1407b12f6d409e5f2799a7ea4b5638c0c332bc32491d11546c681219 module="containerd/tasks"
...
May 25 13:15:07 ubuntu-xenial kubelet[1077]: I0525 13:15:07.718444    1077 kuberuntime_manager.go:513] Container {Name:host-upgrades Image:kontena/pharos-host-upgrades:dev Command:[pharos-host-upgrades] Args:[--schedule=0 * * * * --reboot] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:KUBE_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,
May 25 13:15:07 ubuntu-xenial kubelet[1077]: I0525 13:15:07.718810    1077 kuberuntime_manager.go:757] checking backoff for container "host-upgrades" in pod "host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)"
May 25 13:15:07 ubuntu-xenial kubelet[1077]: I0525 13:15:07.719478    1077 kuberuntime_manager.go:767] Back-off 10s restarting failed container=host-upgrades pod=host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)
May 25 13:15:07 ubuntu-xenial kubelet[1077]: E0525 13:15:07.719670    1077 pod_workers.go:186] Error syncing pod 99f951f7-601d-11e8-9e53-02a3be24b14f ("host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)"), skipping: failed to "StartContainer" for "host-upgrades" with CrashLoopBackOff: "Back-off 10s restarting failed container=host-upgrades pod=host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3
May 25 13:15:07 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:07Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/cf0ff30db38511c1e61b6b760b37a60067e5f485603cbd09bab24d9e9da1a053/shim.sock" debug=false module="containerd/tasks" pid=13753
May 25 13:15:07 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:07Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/a1d832d48cf0a3b7ceb3d34153683db3f5456ee2bbe25baaa5b03006d9c19d4a/shim.sock" debug=false module="containerd/tasks" pid=13769
May 25 13:15:07 ubuntu-xenial kubelet[1077]: I0525 13:15:07.921424    1077 kuberuntime_manager.go:757] checking backoff for container "kube-apiserver" in pod "kube-apiserver-ubuntu-xenial_kube-system(154e3317de22122bd09a6f90e721fb03)"
May 25 13:15:07 ubuntu-xenial kubelet[1077]: I0525 13:15:07.949487    1077 kuberuntime_manager.go:757] checking backoff for container "kube-scheduler" in pod "kube-scheduler-ubuntu-xenial_kube-system(ea66a171667ec4aaf1b274428a42a7cf)"
May 25 13:15:07 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:07Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/a833226f07a78a8563e0f551c9caad6c0955b1a5585d0b285d6a20ca18d74355/shim.sock" debug=false module="containerd/tasks" pid=13849
May 25 13:15:08 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:07Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/e276db5728c6d4a2449bfc48ff64bb1dddd315fce9231f99c4ebdbdbf7226512/shim.sock" debug=false module="containerd/tasks" pid=13862
May 25 13:15:08 ubuntu-xenial kubelet[1077]: W0525 13:15:08.624631    1077 docker_sandbox.go:353] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-86f4d74b45-n5s96_kube-system": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "16447534c8875c62f428eaeaae0c79ba71218481b29a3170487b674eaea20778"
May 25 13:15:08 ubuntu-xenial kubelet[1077]: I0525 13:15:08.985969    1077 kuberuntime_manager.go:513] Container {Name:host-upgrades Image:kontena/pharos-host-upgrades:dev Command:[pharos-host-upgrades] Args:[--schedule=0 * * * * --reboot] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:KUBE_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,
May 25 13:15:08 ubuntu-xenial kubelet[1077]: I0525 13:15:08.986080    1077 kuberuntime_manager.go:757] checking backoff for container "host-upgrades" in pod "host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)"
May 25 13:15:08 ubuntu-xenial kubelet[1077]: I0525 13:15:08.986185    1077 kuberuntime_manager.go:767] Back-off 10s restarting failed container=host-upgrades pod=host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)
May 25 13:15:08 ubuntu-xenial kubelet[1077]: E0525 13:15:08.986221    1077 pod_workers.go:186] Error syncing pod 99f951f7-601d-11e8-9e53-02a3be24b14f ("host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)"), skipping: failed to "StartContainer" for "host-upgrades" with CrashLoopBackOff: "Back-off 10s restarting failed container=host-upgrades pod=host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3
May 25 13:15:09 ubuntu-xenial kubelet[1077]: E0525 13:15:09.938282    1077 remote_runtime.go:278] ContainerStatus "d8b6fe13177421ef5a842d9c6fc16d4caaad4791fff38b49e6893ee362c1d3f9" from runtime service failed: rpc error: code = Unknown desc = Error: No such container: d8b6fe13177421ef5a842d9c6fc16d4caaad4791fff38b49e6893ee362c1d3f9
May 25 13:15:09 ubuntu-xenial kubelet[1077]: I0525 13:15:09.938865    1077 logs.go:49] http: multiple response.WriteHeader calls
May 25 13:15:10 ubuntu-xenial kubelet[1077]: I0525 13:15:10.031243    1077 kuberuntime_manager.go:513] Container {Name:host-upgrades Image:kontena/pharos-host-upgrades:dev Command:[pharos-host-upgrades] Args:[--schedule=0 * * * * --reboot] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:KUBE_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,
May 25 13:15:10 ubuntu-xenial kubelet[1077]: I0525 13:15:10.032228    1077 kuberuntime_manager.go:757] checking backoff for container "host-upgrades" in pod "host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)"
May 25 13:15:10 ubuntu-xenial kubelet[1077]: I0525 13:15:10.032503    1077 kuberuntime_manager.go:767] Back-off 10s restarting failed container=host-upgrades pod=host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)
May 25 13:15:10 ubuntu-xenial kubelet[1077]: E0525 13:15:10.032725    1077 pod_workers.go:186] Error syncing pod 99f951f7-601d-11e8-9e53-02a3be24b14f ("host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3be24b14f)"), skipping: failed to "StartContainer" for "host-upgrades" with CrashLoopBackOff: "Back-off 10s restarting failed container=host-upgrades pod=host-upgrades-sn2vv_kube-system(99f951f7-601d-11e8-9e53-02a3
May 25 13:15:15 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:15.544184426Z" level=info msg="Container f6106ca7d6e5c7667576a9c3484bc58a98da514c76f57b9961e956f3475a6aba failed to exit within 10 seconds of signal 18 - using the force"
May 25 13:15:15 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:15Z" level=info msg="shim reaped" id=f6106ca7d6e5c7667576a9c3484bc58a98da514c76f57b9961e956f3475a6aba module="containerd/tasks"
May 25 13:15:15 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:15.667162654Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
May 25 13:15:15 ubuntu-xenial systemd[1]: Unmounted /var/lib/docker/overlay2/734a760213c0829d918d73169c452a646b37ac781990e229f3e7be606a4f8281/merged.
May 25 13:15:15 ubuntu-xenial kubelet[1077]: W0525 13:15:15.846821    1077 docker_sandbox.go:353] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-86f4d74b45-n5s96_kube-system": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "16447534c8875c62f428eaeaae0c79ba71218481b29a3170487b674eaea20778"
May 25 13:15:15 ubuntu-xenial dockerd[1113]: time="2018-05-25T13:15:15.881653192Z" level=info msg="Container 3e1939a5eccdd263ac7873407f4721188746fff5b34793c43c02e318c22a1360 failed to exit within 10 seconds of signal 15 - using the force"
SpComb commented 6 years ago

Faking a reboot with touch /run/reboot-required results in:

...
2018/05/25 14:17:05 Rebooting host...
2018/05/25 14:17:05 hosts/ubuntu reboot...
2018/05/25 14:17:05 Host is shutting down...
2018/05/25 14:17:05 Leaving kube lock held for reboot, waiting for termination...
2018/05/25 14:17:45 Load config from --config-path=/etc/host-upgrades
2018/05/25 14:17:45 Copying configs to --host-mount=/run/host-upgrades
2018/05/25 14:17:45 hosts/ubuntu boot time: 2018-05-25 14:17:21 +0000 UTC
2018/05/25 14:17:45 hosts/ubuntu probe success: hosts.Info{OperatingSystem:"Ubuntu", OperatingSystemRelease:"16.04.4", Kernel:"Linux", KernelRelease:"4.4.0-127-generic", BootTime:time.Time{wall:0x0, ext:63662854641, loc:(*time.Location)(0x180a020)}}
2018/05/25 14:17:45 Probed host: Ubuntu 16.04.4
2018/05/25 14:17:45 hosts/ubuntu: using host path /run/host-upgrades for output files
2018/05/25 14:17:45 hosts/ubuntu: using copied unattended-upgrades.conf at /run/host-upgrades/unattended-upgrades.conf
2018/05/25 14:17:45 hosts/ubuntu: using generated host-upgrades.sh at /run/host-upgrades/host-upgrades.sh
2018/05/25 14:17:45 Using --kube-namespace=kube-system --kube-daemonset=host-upgrades --kube-node=ubuntu-xenial
2018/05/25 14:17:45 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/25 14:17:46 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=ubuntu-xenial: acquired
2018/05/25 14:17:46 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/25 14:17:46 kube/lock kube-system/daemonsets/host-upgrades: release
2018/05/25 14:17:46 kube/lock kube-system/daemonsets/host-upgrades: clear pharos-host-upgrades.kontena.io/lock=ubuntu-xenial
2018/05/25 14:17:46 kube/lock kube-system/daemonsets/host-upgrades: update
2018/05/25 14:17:46 Released kube lock kube-system/daemonsets/host-upgrades (value=ubuntu-xenial)
...