spinkube / runtime-class-manager

A Kubernetes operator to manage Runtime Classes
Apache License 2.0
24 stars 8 forks source link

node-installer job does not terminate properly #140

Open voigt opened 3 months ago

voigt commented 3 months ago

As part of #68 I investigated an issue in the containerd restart routine. When the node-installer installs a runtime and restarts containerd, the corresponding pod terminates with status Unknown

Overview:

kubectl get job
NAME                            COMPLETIONS   DURATION   AGE
kwasm-worker-spin-v2-install    1/1           28s        21m
kubectl get po
NAME                                  READY   STATUS      RESTARTS   AGE
kwasm-worker-spin-v2-install-n82d9    0/1     Unknown     0          7m25s
kwasm-worker-spin-v2-install-rq78d    0/1     Completed   0          7m3s

Logs of Pod with status Unknown

kubectl logs kwasm-worker-spin-v2-install-n82d9 -c downloader
2024-05-20T20:49:40     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:42     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-n82d9 -c provisioner
2024/05/20 20:49:46 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=true
2024/05/20 20:49:46 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:49:46 INFO restarting containerd

Logs of Pod with status Completed

kubectl logs kwasm-worker-spin-v2-install-rq78d -c downloader
2024-05-20T20:49:57     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:59     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-rq78d -c provisioner
2024/05/20 20:50:00 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=false
2024/05/20 20:50:00 INFO runtime config already exists, skipping runtime=spin-v2
2024/05/20 20:50:00 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:50:00 INFO nothing changed, nothing more to do

The Completed pod only gets scheduled in the first place, as the first one did not terminated successfully; even though the actual job (rewriting containerd config and removing the binary) is done. As a result, the second run of the job has nothing left todo.

Description of Pod with Status Unknown

    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 20 May 2024 22:49:46 +0200
      Finished:     Mon, 20 May 2024 22:49:48 +0200
kubectl describe po kwasm-worker-spin-v2-install-n82d9 ```bash Name: kwasm-worker-spin-v2-install-n82d9 Namespace: default Priority: 0 Service Account: default Node: kwasm-worker/192.168.228.5 Start Time: Mon, 20 May 2024 22:49:35 +0200 Labels: batch.kubernetes.io/controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac batch.kubernetes.io/job-name=kwasm-worker-spin-v2-install controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac job-name=kwasm-worker-spin-v2-install Annotations: Status: Failed IP: 10.244.2.2 IPs: IP: 10.244.2.2 Controlled By: Job/kwasm-worker-spin-v2-install Init Containers: downloader: Container ID: containerd://7f63983e513efa392e3cc684bf53d2553aeb898b4bfe08fb22229fbae83406cb Image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader Image ID: ghcr.io/spinkube/shim-downloader@sha256:719f54c518fc0fc65abbe8ac27978ea188d13faee23530544faf9d622aa2be92 Port: Host Port: State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 20 May 2024 22:49:40 +0200 Finished: Mon, 20 May 2024 22:49:42 +0200 Ready: True Restart Count: 0 Environment: SHIM_NAME: spin-v2 SHIM_LOCATION: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz Mounts: /assets from shim-download (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro) Containers: provisioner: Container ID: containerd://92dd4c994b2fc95d269b5de630c00f55fff233d04d1d649a6b69ce512936278b Image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader Image ID: ghcr.io/spinkube/node-installer@sha256:fcbfa4d8197d3de3b9953219af6a8784f23abf7d798150b2c2a606daaeebe6df Port: Host Port: Args: install -H /mnt/node-root -r spin-v2 State: Terminated Reason: Unknown Exit Code: 255 Started: Mon, 20 May 2024 22:49:46 +0200 Finished: Mon, 20 May 2024 22:49:47 +0200 Ready: False Restart Count: 0 Environment: HOST_ROOT: /mnt/node-root SHIM_FETCH_STRATEGY: /mnt/node-root Mounts: /assets from shim-download (rw) /mnt/node-root from root-mount (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro) Conditions: Type Status PodReadyToStartContainers False Initialized True Ready False ContainersReady False PodScheduled True Volumes: shim-download: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: root-mount: Type: HostPath (bare host directory volume) Path: / HostPathType: kube-api-access-wnr2x: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 25m kubelet Pulling image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader" Normal Pulled 25m kubelet Successfully pulled image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader" in 4.108s (4.108s including waiting) Normal Created 25m kubelet Created container downloader Normal Started 25m kubelet Started container downloader Normal Pulling 25m kubelet Pulling image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader" Normal Pulled 25m kubelet Successfully pulled image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader" in 3.105s (3.105s including waiting) Normal Created 25m kubelet Created container provisioner Normal Started 25m kubelet Started container provisioner ```
Entire resource of Job (e.g. for recreation of the bug) ```bash apiVersion: batch/v1 kind: Job metadata: annotations: kwasm.sh/nodeName: kwasm-worker kwasm.sh/operation: install kwasm.sh/shimName: spin-v2 labels: kwasm-worker-spin-v2-install: "true" kwasm.sh/job: "true" kwasm.sh/operation: install kwasm.sh/shimName: spin-v2 name: kwasm-worker-spin-v2-install namespace: default spec: backoffLimit: 6 completionMode: NonIndexed completions: 1 manualSelector: false parallelism: 1 podReplacementPolicy: TerminatingOrFailed selector: matchLabels: batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac suspend: false template: metadata: creationTimestamp: null labels: batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac batch.kubernetes.io/job-name: kwasm-worker-spin-v2-install controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac job-name: kwasm-worker-spin-v2-install spec: containers: - args: - install - -H - /mnt/node-root - -r - spin-v2 env: - name: HOST_ROOT value: /mnt/node-root - name: SHIM_FETCH_STRATEGY value: /mnt/node-root image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader imagePullPolicy: IfNotPresent name: provisioner resources: {} securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /mnt/node-root name: root-mount - mountPath: /assets name: shim-download dnsPolicy: ClusterFirst hostPID: true initContainers: - env: - name: SHIM_NAME value: spin-v2 - name: SHIM_LOCATION value: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader imagePullPolicy: IfNotPresent name: downloader resources: {} securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /assets name: shim-download nodeName: kwasm-worker restartPolicy: Never schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - emptyDir: {} name: shim-download - hostPath: path: / type: "" name: root-mount status: completionTime: "2024-05-20T20:50:03Z" conditions: - lastProbeTime: "2024-05-20T20:50:03Z" lastTransitionTime: "2024-05-20T20:50:03Z" status: "True" type: Complete failed: 1 ready: 0 startTime: "2024-05-20T20:49:35Z" succeeded: 1 terminating: 0 uncountedTerminatedPods: {} ```

While the goal of installing/uninstalling the shim is achieved, this is not a desired behavior and desires for a solution.

voigt commented 3 months ago

The install-pods of kwasm do not terminate with status Unknown, but Completed. The main difference is, that kwasms install script uses the system schedulers restart functionality.

https://github.com/KWasm/kwasm-node-installer/blob/0ee6ec416f56d35449fbe2f6af072a8643e61686/script/installer.sh#L65

In case of systemd this means, that containerd receives a SIGTERM and only after 90 seconds a SIGKILL (source).

node-installer directly sends a SIGHUP to the containerd process, which seems to me to be the issue.

https://github.com/spinkube/runtime-class-manager/blob/9dee1c02630217342d4eb25d7f0ebb00c52507b3/internal/containerd/restart_unix.go#L45