node-installer job does not terminate properly #140

Open voigt opened 3 months ago

voigt commented 3 months ago

As part of #68 I investigated an issue in the containerd restart routine. When the node-installer installs a runtime and restarts containerd, the corresponding pod terminates with status Unknown


kubectl get job
NAME                            COMPLETIONS   DURATION   AGE
kwasm-worker-spin-v2-install    1/1           28s        21m
kubectl get po
NAME                                  READY   STATUS      RESTARTS   AGE
kwasm-worker-spin-v2-install-n82d9    0/1     Unknown     0          7m25s
kwasm-worker-spin-v2-install-rq78d    0/1     Completed   0          7m3s

Logs of Pod with status Unknown

kubectl logs kwasm-worker-spin-v2-install-n82d9 -c downloader
2024-05-20T20:49:40     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:42     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-n82d9 -c provisioner
2024/05/20 20:49:46 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=true
2024/05/20 20:49:46 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:49:46 INFO restarting containerd

Logs of Pod with status Completed

kubectl logs kwasm-worker-spin-v2-install-rq78d -c downloader
2024-05-20T20:49:57     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:59     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2
kubectl logs kwasm-worker-spin-v2-install-rq78d -c provisioner
2024/05/20 20:50:00 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=false
2024/05/20 20:50:00 INFO runtime config already exists, skipping runtime=spin-v2
2024/05/20 20:50:00 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:50:00 INFO nothing changed, nothing more to do

The Completed pod only gets scheduled in the first place, as the first one did not terminated successfully; even though the actual job (rewriting containerd config and removing the binary) is done. As a result, the second run of the job has nothing left todo.

Description of Pod with Status Unknown

    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 20 May 2024 22:49:46 +0200
      Finished:     Mon, 20 May 2024 22:49:48 +0200
kubectl describe po kwasm-worker-spin-v2-install-n82d9 ```bash Name: kwasm-worker-spin-v2-install-n82d9 Namespace: default Priority: 0 Service Account: default Node: kwasm-worker/ Start Time: Mon, 20 May 2024 22:49:35 +0200 Labels: batch.kubernetes.io/controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac batch.kubernetes.io/job-name=kwasm-worker-spin-v2-install controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac job-name=kwasm-worker-spin-v2-install Annotations: Status: Failed IP: IPs: IP: Controlled By: Job/kwasm-worker-spin-v2-install Init Containers: downloader: Container ID: containerd://7f63983e513efa392e3cc684bf53d2553aeb898b4bfe08fb22229fbae83406cb Image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader Image ID: ghcr.io/spinkube/shim-downloader@sha256:719f54c518fc0fc65abbe8ac27978ea188d13faee23530544faf9d622aa2be92 Port: Host Port: State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 20 May 2024 22:49:40 +0200 Finished: Mon, 20 May 2024 22:49:42 +0200 Ready: True Restart Count: 0 Environment: SHIM_NAME: spin-v2 SHIM_LOCATION: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz Mounts: /assets from shim-download (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro) Containers: provisioner: Container ID: containerd://92dd4c994b2fc95d269b5de630c00f55fff233d04d1d649a6b69ce512936278b Image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader Image ID: ghcr.io/spinkube/node-installer@sha256:fcbfa4d8197d3de3b9953219af6a8784f23abf7d798150b2c2a606daaeebe6df Port: Host Port: Args: install -H /mnt/node-root -r spin-v2 State: Terminated Reason: Unknown Exit Code: 255 Started: Mon, 20 May 2024 22:49:46 +0200 Finished: Mon, 20 May 2024 22:49:47 +0200 Ready: False Restart Count: 0 Environment: HOST_ROOT: /mnt/node-root SHIM_FETCH_STRATEGY: /mnt/node-root Mounts: /assets from shim-download (rw) /mnt/node-root from root-mount (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro) Conditions: Type Status PodReadyToStartContainers False Initialized True Ready False ContainersReady False PodScheduled True Volumes: shim-download: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: root-mount: Type: HostPath (bare host directory volume) Path: / HostPathType: kube-api-access-wnr2x: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 25m kubelet Pulling image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader" Normal Pulled 25m kubelet Successfully pulled image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader" in 4.108s (4.108s including waiting) Normal Created 25m kubelet Created container downloader Normal Started 25m kubelet Started container downloader Normal Pulling 25m kubelet Pulling image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader" Normal Pulled 25m kubelet Successfully pulled image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader" in 3.105s (3.105s including waiting) Normal Created 25m kubelet Created container provisioner Normal Started 25m kubelet Started container provisioner ```
Entire resource of Job (e.g. for recreation of the bug) ```bash apiVersion: batch/v1 kind: Job metadata: annotations: kwasm.sh/nodeName: kwasm-worker kwasm.sh/operation: install kwasm.sh/shimName: spin-v2 labels: kwasm-worker-spin-v2-install: "true" kwasm.sh/job: "true" kwasm.sh/operation: install kwasm.sh/shimName: spin-v2 name: kwasm-worker-spin-v2-install namespace: default spec: backoffLimit: 6 completionMode: NonIndexed completions: 1 manualSelector: false parallelism: 1 podReplacementPolicy: TerminatingOrFailed selector: matchLabels: batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac suspend: false template: metadata: creationTimestamp: null labels: batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac batch.kubernetes.io/job-name: kwasm-worker-spin-v2-install controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac job-name: kwasm-worker-spin-v2-install spec: containers: - args: - install - -H - /mnt/node-root - -r - spin-v2 env: - name: HOST_ROOT value: /mnt/node-root - name: SHIM_FETCH_STRATEGY value: /mnt/node-root image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader imagePullPolicy: IfNotPresent name: provisioner resources: {} securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /mnt/node-root name: root-mount - mountPath: /assets name: shim-download dnsPolicy: ClusterFirst hostPID: true initContainers: - env: - name: SHIM_NAME value: spin-v2 - name: SHIM_LOCATION value: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader imagePullPolicy: IfNotPresent name: downloader resources: {} securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /assets name: shim-download nodeName: kwasm-worker restartPolicy: Never schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - emptyDir: {} name: shim-download - hostPath: path: / type: "" name: root-mount status: completionTime: "2024-05-20T20:50:03Z" conditions: - lastProbeTime: "2024-05-20T20:50:03Z" lastTransitionTime: "2024-05-20T20:50:03Z" status: "True" type: Complete failed: 1 ready: 0 startTime: "2024-05-20T20:49:35Z" succeeded: 1 terminating: 0 uncountedTerminatedPods: {} ```

While the goal of installing/uninstalling the shim is achieved, this is not a desired behavior and desires for a solution.

voigt commented 3 months ago

The install-pods of kwasm do not terminate with status Unknown, but Completed. The main difference is, that kwasms install script uses the system schedulers restart functionality.


In case of systemd this means, that containerd receives a SIGTERM and only after 90 seconds a SIGKILL (source).

node-installer directly sends a SIGHUP to the containerd process, which seems to me to be the issue.
