rancher / system-upgrade-controller

In your Kubernetes, upgrading your nodes
Apache License 2.0
676 stars 83 forks source link

Automated upgrade not starting #262

Closed anon-software closed 9 months ago

anon-software commented 9 months ago

Version

$ kubectl exec -n system-upgrade system-upgrade-controller-5876667756-qw65d -- /bin/system-upgrade-controller --version
system-upgrade-controller version v0.13.1 (04a0b9e)

Platform/Architecture

$ echo "$(go env GOOS)-$(go env GOARCH)"
linux-arm64

Describe the bug

I followed the instructions at https://docs.k3s.io/upgrades/automated to set up the automated upgrades. For the initial configuration I specified only the control plane node upgrade plan. However no action appear to have been taken.

To Reproduce

$ kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
namespace/system-upgrade created
serviceaccount/system-upgrade created
clusterrolebinding.rbac.authorization.k8s.io/system-upgrade created
configmap/default-controller-env created
deployment.apps/system-upgrade-controller created

$ kubectl apply -f control.yml 
plan.upgrade.cattle.io/server-plan created

The content of control.yml can be seen below. The controller and the plan have been loaded at this point, but nothing happened which you can see from the node versions.

$ kubectl -n system-upgrade get jobs
No resources found in system-upgrade namespace.

$ kubectl -n system-upgrade get plans
NAME          IMAGE                 CHANNEL                                            VERSION
server-plan   rancher/k3s-upgrade   https://update.k3s.io/v1-release/channels/stable

$ kubectl get nodes
NAME            STATUS   ROLES                       AGE    VERSION
turing-node-1   Ready    control-plane,etcd,master   140d   v1.26.4+k3s1
turing-node-2   Ready    <none>                      140d   v1.26.4+k3s1
turing-node-3   Ready    <none>                      140d   v1.26.4+k3s1

Expected behavior

Based on the plan and the current version, I expected to see a job that would upgrade the control plane node to version 1.27. Or, if there is something wrong with my environment (for example, it has occurred to me that I might need more than one control plane node to stagger an upgrade), I would expect to see an appropriate message in the log, but there is nothing of interest in it:

$  kubectl get pod -n system-upgrade
NAME                                         READY   STATUS    RESTARTS   AGE
system-upgrade-controller-5876667756-qw65d   1/1     Running   0          21m
$  kubectl logs -n system-upgrade system-upgrade-controller-5876667756-qw65d
W0925 20:58:13.017716       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-09-25T20:58:13Z" level=info msg="Applying CRD plans.upgrade.cattle.io"
time="2023-09-25T20:58:13Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2023-09-25T20:58:13Z" level=info msg="Starting /v1, Kind=Node controller"
time="2023-09-25T20:58:13Z" level=info msg="Starting batch/v1, Kind=Job controller"
time="2023-09-25T20:58:13Z" level=info msg="Starting upgrade.cattle.io/v1, Kind=Plan controller"

Actual behavior

Nothing happened.

Additional context

There had been a similar bug report I found https://github.com/rancher/system-upgrade-controller/issues/90, but in that case the node selector in the plan was wrong. Here is what the plan looks in this case:

$ kubectl -n system-upgrade get plan server-plan -o yaml
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"upgrade.cattle.io/v1","kind":"Plan","metadata":{"annotations":{},"name":"server-plan","namespace":"system-upgrade"},"spec":{"channel":"https://update.k3s.io/v1-release/channels/stable","concurrency":1,"cordon":true,"nodeSelector":{"matchExpressions":[{"key":"node-role.kubernetes.io/control-plane","operator":"In","values":["true"]}]},"serviceAccountName":"system-upgrade","upgrade":{"image":"rancher/k3s-upgrade"}}}
  creationTimestamp: "2023-09-25T20:58:58Z"
  generation: 1
  name: server-plan
  namespace: system-upgrade
  resourceVersion: "62801558"
  uid: 592d8afd-2429-4fa6-9848-a615aa6a3043
spec:
  channel: https://update.k3s.io/v1-release/channels/stable
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/control-plane
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
status:
  conditions:
  - lastUpdateTime: "2023-09-25T21:06:50Z"
    reason: PlanIsValid
    status: "True"
    type: Validated
  - lastUpdateTime: "2023-09-25T20:58:58Z"
    reason: Channel
    status: "True"
    type: LatestResolved
  latestHash: 0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
  latestVersion: v1.27.6-k3s1

And the selector verification follows:

$ kubectl get node -n system-upgrade -l "node-role.kubernetes.io/control-plane in (true)"
NAME            STATUS   ROLES                       AGE    VERSION
turing-node-1   Ready    control-plane,etcd,master   140d   v1.26.4+k3s1
brandond commented 9 months ago

I can't reproduce this.

brandond@dev01:~$ kubectl get node -o wide
NAME           STATUS  ROLES                  AGE     VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE   KERNEL-VERSION    CONTAINER-RUNTIME
k3s-server-1   Ready   control-plane,master   6m36s   v1.26.4+k3s1   172.17.0.4    <none>        K3s dev    5.19.0-1019-aws   containerd://1.6.19-k3s1

brandond@dev01:~$ kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
namespace/system-upgrade created
serviceaccount/system-upgrade created
clusterrolebinding.rbac.authorization.k8s.io/system-upgrade created
configmap/default-controller-env created
deployment.apps/system-upgrade-controller created

brandond@dev01:~$ kubectl get pod -A
NAMESPACE        NAME                                         READY   STATUS      RESTARTS   AGE
kube-system      local-path-provisioner-76d776f6f9-fhdmj      1/1     Running     0          26s
kube-system      coredns-59b4f5bbd5-p4vwz                     1/1     Running     0          26s
kube-system      svclb-traefik-6c3a9382-qfzfc                 2/2     Running     0          19s
kube-system      helm-install-traefik-crd-4q2kw               0/1     Completed   0          27s
kube-system      helm-install-traefik-rsx42                   0/1     Completed   1          27s
kube-system      traefik-56b8c5fb5c-7sdkv                     1/1     Running     0          19s
kube-system      metrics-server-7b67f64457-5685l              1/1     Running     0          26s
system-upgrade   system-upgrade-controller-5876667756-2ppqw   1/1     Running     0          8s

brandond@dev01:~$ kubectl apply -f -
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  channel: https://update.k3s.io/v1-release/channels/stable
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/control-plane
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade

plan.upgrade.cattle.io/server-plan created

brandond@dev01:~/go/src/github.com/k3s-io/k3s$ kubectl get job -n system-upgrade
NAME                                                              COMPLETIONS   DURATION   AGE
apply-server-plan-on-k3s-server-1-with-0e4e3f4e3f8b1e811d-f6e12   0/1           48s        48s

Do you perhaps have something else deployed to your cluster that's blocking creation of the upgrade job? Have you tried increasing the verbosity of the system-upgrade-controller?

anon-software commented 9 months ago

I have got a bunch of stuff running, but I do not know if any of that would interfere with the upgrade. I have turned on the debug level logging, but I still do not see anything relevant there:

W0926 03:06:33.543448       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-09-26T03:06:33Z" level=info msg="Applying CRD plans.upgrade.cattle.io" func="github.com/rancher/wrangler/pkg/crd.(*Factory).createCRD" file="/go/pkg/mod/github.com/rancher/wrangler@v1.1.1-0.20230425173236-39a4707f0689/pkg/crd/init.go:543"
time="2023-09-26T03:06:33Z" level=debug msg="DesiredSet - Patch apiextensions.k8s.io/v1, Kind=CustomResourceDefinition /plans.upgrade.cattle.io for  plans.upgrade.cattle.io -- [ placeholder-for-stuff-dropped ]" func=github.com/rancher/wrangler/pkg/apply.applyPatch file="/go/pkg/mod/github.com/rancher/wrangler@v1.1.1-0.20230425173236-39a4707f0689/pkg/apply/desiredset_compare.go:210"
time="2023-09-26T03:06:33Z" level=debug msg="DesiredSet - Updated apiextensions.k8s.io/v1, Kind=CustomResourceDefinition /plans.upgrade.cattle.io for  plans.upgrade.cattle.io -- application/merge-patch+json {\"metadata\":{},\"spec\":{\"preserveUnknownFields\":false}}" func=github.com/rancher/wrangler/pkg/apply.applyPatch file="/go/pkg/mod/github.com/rancher/wrangler@v1.1.1-0.20230425173236-39a4707f0689/pkg/apply/desiredset_compare.go:232"
time="2023-09-26T03:06:34Z" level=info msg="Starting /v1, Kind=Node controller" func="github.com/rancher/lasso/pkg/controller.(*controller).run" file="/go/pkg/mod/github.com/rancher/lasso@v0.0.0-20221227210133-6ea88ca2fbcc/pkg/controller/controller.go:144"
time="2023-09-26T03:06:34Z" level=info msg="Starting /v1, Kind=Secret controller" func="github.com/rancher/lasso/pkg/controller.(*controller).run" file="/go/pkg/mod/github.com/rancher/lasso@v0.0.0-20221227210133-6ea88ca2fbcc/pkg/controller/controller.go:144"
time="2023-09-26T03:06:34Z" level=info msg="Starting batch/v1, Kind=Job controller" func="github.com/rancher/lasso/pkg/controller.(*controller).run" file="/go/pkg/mod/github.com/rancher/lasso@v0.0.0-20221227210133-6ea88ca2fbcc/pkg/controller/controller.go:144"
time="2023-09-26T03:06:34Z" level=info msg="Starting upgrade.cattle.io/v1, Kind=Plan controller" func="github.com/rancher/lasso/pkg/controller.(*controller).run" file="/go/pkg/mod/github.com/rancher/lasso@v0.0.0-20221227210133-6ea88ca2fbcc/pkg/controller/controller.go:144"
time="2023-09-26T03:06:34Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/server-plan@62920622, status={Conditions:[{Type:Validated Status:True LastUpdateTime:2023-09-26T03:04:13Z LastTransitionTime: Reason:PlanIsValid Message:} {Type:LatestResolved Status:True LastUpdateTime:2023-09-26T02:59:00Z LastTransitionTime: Reason:Channel Message:}] LatestVersion:v1.27.6-k3s1 LatestHash:0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c Applying:[]}" func="github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handlePlans.func1" file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_upgrade.go:30"
time="2023-09-26T03:06:34Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/server-plan@62921428, status={Conditions:[{Type:Validated Status:True LastUpdateTime:2023-09-26T03:06:34Z LastTransitionTime: Reason:PlanIsValid Message:} {Type:LatestResolved Status:True LastUpdateTime:2023-09-26T02:59:00Z LastTransitionTime: Reason:Channel Message:}] LatestVersion:v1.27.6-k3s1 LatestHash:0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c Applying:[]}" func="github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handlePlans.func2" file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_upgrade.go:78"
time="2023-09-26T03:06:34Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/server-plan@62921428, status={Conditions:[{Type:Validated Status:True LastUpdateTime:2023-09-26T03:06:34Z LastTransitionTime: Reason:PlanIsValid Message:} {Type:LatestResolved Status:True LastUpdateTime:2023-09-26T02:59:00Z LastTransitionTime: Reason:Channel Message:}] LatestVersion:v1.27.6-k3s1 LatestHash:0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c Applying:[]}" func="github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handlePlans.func1" file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_upgrade.go:30"
time="2023-09-26T03:06:34Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/server-plan@62921428, status={Conditions:[{Type:Validated Status:True LastUpdateTime:2023-09-26T03:06:34Z LastTransitionTime: Reason:PlanIsValid Message:} {Type:LatestResolved Status:True LastUpdateTime:2023-09-26T02:59:00Z LastTransitionTime: Reason:Channel Message:}] LatestVersion:v1.27.6-k3s1 LatestHash:0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c Applying:[]}" func="github.com/rancher/system-upgrade-controller/pkg/upgrade.(*Controller).handlePlans.func2" file="/go/src/github.com/rancher/system-upgrade-controller/pkg/upgrade/handle_upgrade.go:78"
brandond commented 9 months ago

Can you show the output of kubectl get node turing-node-1 -o yaml ?

It looks like for some reason the node selector isn't finding any nodes to create jobs for... https://github.com/rancher/system-upgrade-controller/blob/04a0b9ef5858657f20949cd022e58ad19de029df/pkg/upgrade/plan/plan.go#L168-L172

anon-software commented 9 months ago

I tested the selector by using it in kubectl command to filter the node, you can see the command in my original post. In any case, here is the requested output:

$  kubectl get node turing-node-1 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    etcd.k3s.cattle.io/node-address: 192.168.2.253
    etcd.k3s.cattle.io/node-name: turing-node-1-15e70d0d
    flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"f6:51:16:0b:a0:13"}'
    flannel.alpha.coreos.com/backend-type: vxlan
    flannel.alpha.coreos.com/kube-subnet-manager: "true"
    flannel.alpha.coreos.com/public-ip: 192.168.2.253
    k3s.io/hostname: turing-node-1
    k3s.io/internal-ip: 192.168.2.253,2600:1700:38c2:8a10::41
    k3s.io/node-args: '["server","--cluster-init"]'
    k3s.io/node-config-hash: XFJS3VE5KBQGMQZO4QOV2XN233KRIPYZWEZD3YOPYBRFV6NHLY3A====
    k3s.io/node-env: '{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/4b147cafa965066cd68e04b4e3acce221078156a3b9ba635a653517ce459aa4d"}'
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2023-05-08T04:17:41Z"
  finalizers:
  - wrangler.cattle.io/managed-etcd-controller
  - wrangler.cattle.io/node
  labels:
    beta.kubernetes.io/arch: arm64
    beta.kubernetes.io/instance-type: k3s
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: arm64
    kubernetes.io/hostname: turing-node-1
    kubernetes.io/os: linux
    node-role.kubernetes.io/control-plane: "true"
    node-role.kubernetes.io/etcd: "true"
    node-role.kubernetes.io/master: "true"
    node.kubernetes.io/instance-type: k3s
    plan.upgrade.cattle.io/server-plan: 0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
  name: turing-node-1
  resourceVersion: "63160361"
  uid: df0191c4-ce5d-4755-a5d9-f5a989dfebea
spec:
  podCIDR: 10.42.0.0/24
  podCIDRs:
  - 10.42.0.0/24
  providerID: k3s://turing-node-1
status:
  addresses:
  - address: 192.168.2.253
    type: InternalIP
  - address: 2600:1700:38c2:8a10::41
    type: InternalIP
  - address: turing-node-1
    type: Hostname
  allocatable:
    cpu: "4"
    ephemeral-storage: "29559886006"
    memory: 7999972Ki
    pods: "110"
  capacity:
    cpu: "4"
    ephemeral-storage: 30386396Ki
    memory: 7999972Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2023-09-26T15:04:05Z"
    lastTransitionTime: "2023-05-14T17:32:37Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2023-09-26T15:04:05Z"
    lastTransitionTime: "2023-05-14T17:32:37Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2023-09-26T15:04:05Z"
    lastTransitionTime: "2023-05-14T17:32:37Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2023-09-26T15:04:05Z"
    lastTransitionTime: "2023-09-22T19:10:56Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - ghcr.io/home-assistant/home-assistant@sha256:0c4475289186eeadf1b987a6a3df7bbc6d3b33bed6bcb1dbc8d6aabfdaf798ed
    - ghcr.io/home-assistant/home-assistant:2023.3.5
    sizeBytes: 453949277
  - names:
    - ghcr.io/home-assistant/home-assistant@sha256:0a0ae67f5a3121d50890baf1f07baa687468fe448e635e2c34d2b95faf5086b0
    - ghcr.io/home-assistant/home-assistant:2023.3.1
    sizeBytes: 453770753
  - names:
    - ghcr.io/home-assistant/home-assistant@sha256:2c631c99d7078072126e50050b57042ec5548b721f089a87e76dfb24c1071a83
    - ghcr.io/home-assistant/home-assistant:2023.5.4
    sizeBytes: 451027806
  - names:
    - ghcr.io/home-assistant/home-assistant@sha256:d38bc4d21453d6e3e4b0af2b62cf86211b28479946e4e895d4434b3f82c4e58a
    - ghcr.io/home-assistant/home-assistant:2022.12.8
    sizeBytes: 446792583
  - names:
    - docker.io/library/nextcloud@sha256:0ab4b64883b3adf121a3076cd9b8a160a224aa1fa81f75cdb7c4bc4fdeaaa803
    - docker.io/library/nextcloud:26.0.1
    sizeBytes: 347153037
  - names:
    - docker.io/library/mariadb@sha256:37e9f7e3cea0096f7fba9d2a77cf0ac926c830e8931d1679da3bcd8fb8989d47
    - docker.io/library/mariadb:10.6
    sizeBytes: 118893377
  - names:
    - docker.io/pihole/pihole@sha256:dcd0885a3fe050da005cb544904444cc098017636d6d495ac8770a9aa523a0ef
    - docker.io/pihole/pihole:2022.05
    sizeBytes: 111326841
  - names:
    - docker.io/rancher/k3s-upgrade@sha256:6c4543ecde336df20a21f88e5e84399f923bdb3f9bbdc7e815cfdbca643ec50a
    - docker.io/rancher/k3s-upgrade:v1.27.6-k3s1
    sizeBytes: 53055042
  - names:
    - docker.io/anonsoftware28/kubernetes-secret-generator@sha256:1d5bfe7b227caf060d0e61488aecdc40e475f8c8640420fbf7ab500333dcfd60
    - docker.io/anonsoftware28/kubernetes-secret-generator:latest
    sizeBytes: 51735974
  - names:
    - docker.io/rancher/mirrored-library-traefik@sha256:0842af6afcdf4305d17e862bad4eaf379d0817c987eedabeaff334e2273459c1
    - docker.io/rancher/mirrored-library-traefik:2.9.4
    sizeBytes: 35650744
  - names:
    - docker.io/rancher/mirrored-metrics-server@sha256:16185c0d4d01f8919eca4779c69a374c184200cd9e6eded9ba53052fd73578df
    - docker.io/rancher/mirrored-metrics-server:v0.6.2
    sizeBytes: 26205509
  - names:
    - docker.io/dopingus/cert-manager-webhook-dynu@sha256:7958523006f78123305597115cb1ba7f7b448e658549ddb6a089582c4bec8628
    - docker.io/dopingus/cert-manager-webhook-dynu:latest
    sizeBytes: 17882163
  - names:
    - docker.io/dopingus/cert-manager-webhook-dynu@sha256:7618e6678a9f3210ef0ea530a0f58f5932e80aa673729a7ab223a9b24b804cd2
    sizeBytes: 17882150
  - names:
    - registry.k8s.io/sig-storage/nfs-subdir-external-provisioner@sha256:63d5e04551ec8b5aae83b6f35938ca5ddc50a88d85492d9731810c31591fa4c9
    - registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2
    sizeBytes: 16673053
  - names:
    - quay.io/jetstack/cert-manager-controller@sha256:cd9bf3d48b6b8402a2a8b11953f9dc0275ba4beec14da47e31823a0515cde7e2
    - quay.io/jetstack/cert-manager-controller:v1.9.1
    sizeBytes: 15265466
  - names:
    - docker.io/rancher/mirrored-coredns-coredns@sha256:a11fafae1f8037cbbd66c5afa40ba2423936b72b4fd50a7034a7e8b955163594
    - docker.io/rancher/mirrored-coredns-coredns:1.10.1
    sizeBytes: 14556850
  - names:
    - docker.io/rancher/local-path-provisioner@sha256:5bb33992a4ec3034c28b5e0b3c4c2ac35d3613b25b79455eb4b1a95adc82cdc0
    - docker.io/rancher/local-path-provisioner:v0.0.24
    sizeBytes: 13884168
  - names:
    - docker.io/rancher/kubectl@sha256:9be095ca0bbc74e8947a1d4a0258875304b590057d858eb9738de000f88a473e
    - docker.io/rancher/kubectl:v1.25.4
    sizeBytes: 13045642
  - names:
    - quay.io/jetstack/cert-manager-webhook@sha256:4ab2982a220e1c719473d52d8463508422ab26e92664732bfc4d96b538af6b8a
    - quay.io/jetstack/cert-manager-webhook:v1.9.1
    sizeBytes: 12244995
  - names:
    - quay.io/jetstack/cert-manager-cainjector@sha256:df7f0b5186ddb84eccb383ed4b10ec8b8e2a52e0e599ec51f98086af5f4b4938
    - quay.io/jetstack/cert-manager-cainjector:v1.9.1
    sizeBytes: 10909067
  - names:
    - docker.io/rancher/system-upgrade-controller@sha256:c730c4ec8dc914b94be13df77d9b58444277330a2bdf39fe667beb5af2b38c0b
    - docker.io/rancher/system-upgrade-controller:v0.13.1
    sizeBytes: 9617607
  - names:
    - docker.io/rancher/klipper-lb@sha256:2b963c02974155f7e9a51c54b91f09099e48b4550689aadb595e62118e045c10
    - docker.io/rancher/klipper-lb:v0.4.3
    sizeBytes: 4163722
  - names:
    - docker.io/rancher/mirrored-pause@sha256:74c4244427b7312c5b901fe0f67cbc53683d06f4f24c6faee65d4182bf0fa893
    - docker.io/rancher/mirrored-pause:3.6
    sizeBytes: 253243
  nodeInfo:
    architecture: arm64
    bootID: 5bcbbf33-f7d7-4058-a5f4-94a5d26e129c
    containerRuntimeVersion: containerd://1.6.19-k3s1
    kernelVersion: 5.15.32-v8+
    kubeProxyVersion: v1.26.4+k3s1
    kubeletVersion: v1.26.4+k3s1
    machineID: 75a2a6365a604bc389ab0ab7c51c66c6
    operatingSystem: linux
    osImage: Debian GNU/Linux 11 (bullseye)
    systemUUID: 75a2a6365a604bc389ab0ab7c51c66c6
brandond commented 9 months ago

I tested the selector by using it in kubectl command to filter the node, you can see the command in my original post.

Yes, but your provided node selector is merged with the plan hash label selector; I linked the code where that occurs up above.

In this case your node has a label on it that indicates this plan has already run successfully on this node: plan.upgrade.cattle.io/server-plan: 0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c

Delete the label and it should run again. You may want to keep an eye on it more closely, it sounds like the upgrade image ran successfully and the jobs were cleaned up, despite having not successfully upgraded the version on the node.

anon-software commented 9 months ago

Thanks, we are getting somewhere. After I had removed the label the job immediately executed. The bad news is that the node version is still the same but I could not find any error. Here is where I looked.

$ kubectl get job -n system-upgrade
NAME                                                              COMPLETIONS   DURATION   AGE
apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-0c1d4   1/1           9s         77s
$ kubectl describe job -n system-upgrade apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-0c1d4
Name:                     apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-0c1d4
Namespace:                system-upgrade
Selector:                 controller-uid=6a477e0d-3281-4af3-9470-bcda9218cd78
Labels:                   objectset.rio.cattle.io/hash=d661ea5d7278683dce770ce40b105bf148fce4d9
                          plan.upgrade.cattle.io/server-plan=0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
                          upgrade.cattle.io/controller=system-upgrade-controller
                          upgrade.cattle.io/node=turing-node-1
                          upgrade.cattle.io/plan=server-plan
                          upgrade.cattle.io/version=v1.27.6-k3s1
Annotations:              batch.kubernetes.io/job-tracking: 
                          objectset.rio.cattle.io/applied:
                            H4sIAAAAAAAA/+xXUW/iOBD+Kyc/JzSBlBKke+AKe4t2C6h097RaVZVjT8CHY+dsB4oQ//1kJ9CE0m537+UeqqptHNvjzzPfNzPZoQwMpthg1N8hLIQ02DAptB3K5G8gRoNpKSZbBB...
                          objectset.rio.cattle.io/id: system-upgrade-controller
                          objectset.rio.cattle.io/owner-gvk: upgrade.cattle.io/v1, Kind=Plan
                          objectset.rio.cattle.io/owner-name: server-plan
                          objectset.rio.cattle.io/owner-namespace: system-upgrade
                          upgrade.cattle.io/ttl-seconds-after-finished: 900
Controlled By:            Plan/server-plan
Parallelism:              1
Completions:              1
Completion Mode:          NonIndexed
Start Time:               Tue, 26 Sep 2023 10:36:06 -0700
Completed At:             Tue, 26 Sep 2023 10:36:15 -0700
Duration:                 9s
Active Deadline Seconds:  900s
Pods Statuses:            0 Active (0 Ready) / 1 Succeeded / 0 Failed
Pod Template:
  Labels:           controller-uid=6a477e0d-3281-4af3-9470-bcda9218cd78
                    job-name=apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-0c1d4
                    plan.upgrade.cattle.io/server-plan=0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
                    upgrade.cattle.io/controller=system-upgrade-controller
                    upgrade.cattle.io/node=turing-node-1
                    upgrade.cattle.io/plan=server-plan
                    upgrade.cattle.io/version=v1.27.6-k3s1
  Service Account:  system-upgrade
  Init Containers:
   cordon:
    Image:      rancher/kubectl:v1.25.4
    Port:       <none>
    Host Port:  <none>
    Args:
      cordon
      turing-node-1
    Environment:
      SYSTEM_UPGRADE_NODE_NAME:             (v1:spec.nodeName)
      SYSTEM_UPGRADE_POD_NAME:              (v1:metadata.name)
      SYSTEM_UPGRADE_POD_UID:               (v1:metadata.uid)
      SYSTEM_UPGRADE_PLAN_NAME:            server-plan
      SYSTEM_UPGRADE_PLAN_LATEST_HASH:     0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
      SYSTEM_UPGRADE_PLAN_LATEST_VERSION:  v1.27.6-k3s1
    Mounts:
      /host from host-root (rw)
      /run/system-upgrade/pod from pod-info (ro)
  Containers:
   upgrade:
    Image:      rancher/k3s-upgrade:v1.27.6-k3s1
    Port:       <none>
    Host Port:  <none>
    Environment:
      SYSTEM_UPGRADE_NODE_NAME:             (v1:spec.nodeName)
      SYSTEM_UPGRADE_POD_NAME:              (v1:metadata.name)
      SYSTEM_UPGRADE_POD_UID:               (v1:metadata.uid)
      SYSTEM_UPGRADE_PLAN_NAME:            server-plan
      SYSTEM_UPGRADE_PLAN_LATEST_HASH:     0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
      SYSTEM_UPGRADE_PLAN_LATEST_VERSION:  v1.27.6-k3s1
    Mounts:
      /host from host-root (rw)
      /run/system-upgrade/pod from pod-info (ro)
  Volumes:
   host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory
   pod-info:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
Events:
  Type    Reason            Age    From            Message
  ----    ------            ----   ----            -------
  Normal  SuccessfulCreate  2m17s  job-controller  Created pod: apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-48ttz
  Normal  Completed         2m8s   job-controller  Job completed
$ kubectl describe pod apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-48ttz -n system-upgrade
Name:             apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-48ttz
Namespace:        system-upgrade
Priority:         0
Service Account:  system-upgrade
Node:             turing-node-1/192.168.2.253
Start Time:       Tue, 26 Sep 2023 10:36:06 -0700
Labels:           controller-uid=6a477e0d-3281-4af3-9470-bcda9218cd78
                  job-name=apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-0c1d4
                  plan.upgrade.cattle.io/server-plan=0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
                  upgrade.cattle.io/controller=system-upgrade-controller
                  upgrade.cattle.io/node=turing-node-1
                  upgrade.cattle.io/plan=server-plan
                  upgrade.cattle.io/version=v1.27.6-k3s1
Annotations:      <none>
Status:           Succeeded
IP:               192.168.2.253
IPs:
  IP:           192.168.2.253
  IP:           2600:1700:38c2:8a10::41
Controlled By:  Job/apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-0c1d4
Init Containers:
  cordon:
    Container ID:  containerd://086924d34d110d054a173ba7cd23c1a4b59f31bef24fec746f7ded4e4b525c4b
    Image:         rancher/kubectl:v1.25.4
    Image ID:      docker.io/rancher/kubectl@sha256:9be095ca0bbc74e8947a1d4a0258875304b590057d858eb9738de000f88a473e
    Port:          <none>
    Host Port:     <none>
    Args:
      cordon
      turing-node-1
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 26 Sep 2023 10:36:08 -0700
      Finished:     Tue, 26 Sep 2023 10:36:08 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      SYSTEM_UPGRADE_NODE_NAME:             (v1:spec.nodeName)
      SYSTEM_UPGRADE_POD_NAME:             apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-48ttz (v1:metadata.name)
      SYSTEM_UPGRADE_POD_UID:               (v1:metadata.uid)
      SYSTEM_UPGRADE_PLAN_NAME:            server-plan
      SYSTEM_UPGRADE_PLAN_LATEST_HASH:     0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
      SYSTEM_UPGRADE_PLAN_LATEST_VERSION:  v1.27.6-k3s1
    Mounts:
      /host from host-root (rw)
      /run/system-upgrade/pod from pod-info (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cr9rc (ro)
Containers:
  upgrade:
    Container ID:   containerd://7ed552599987187c5b0eee160cabcfd11f1faac15e4a32a5cfbca8711f5ccb7f
    Image:          rancher/k3s-upgrade:v1.27.6-k3s1
    Image ID:       docker.io/rancher/k3s-upgrade@sha256:6c4543ecde336df20a21f88e5e84399f923bdb3f9bbdc7e815cfdbca643ec50a
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 26 Sep 2023 10:36:10 -0700
      Finished:     Tue, 26 Sep 2023 10:36:12 -0700
    Ready:          False
    Restart Count:  0
    Environment:
      SYSTEM_UPGRADE_NODE_NAME:             (v1:spec.nodeName)
      SYSTEM_UPGRADE_POD_NAME:             apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-48ttz (v1:metadata.name)
      SYSTEM_UPGRADE_POD_UID:               (v1:metadata.uid)
      SYSTEM_UPGRADE_PLAN_NAME:            server-plan
      SYSTEM_UPGRADE_PLAN_LATEST_HASH:     0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
      SYSTEM_UPGRADE_PLAN_LATEST_VERSION:  v1.27.6-k3s1
    Mounts:
      /host from host-root (rw)
      /run/system-upgrade/pod from pod-info (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cr9rc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory
  pod-info:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
  kube-api-access-cr9rc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  5m27s  default-scheduler  Successfully assigned system-upgrade/apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-48ttz to turing-node-1
  Normal  Pulling    5m27s  kubelet            Pulling image "rancher/kubectl:v1.25.4"
  Normal  Pulled     5m27s  kubelet            Successfully pulled image "rancher/kubectl:v1.25.4" in 749.280135ms (749.302747ms including waiting)
  Normal  Created    5m27s  kubelet            Created container cordon
  Normal  Started    5m26s  kubelet            Started container cordon
  Normal  Pulling    5m25s  kubelet            Pulling image "rancher/k3s-upgrade:v1.27.6-k3s1"
  Normal  Pulled     5m24s  kubelet            Successfully pulled image "rancher/k3s-upgrade:v1.27.6-k3s1" in 657.589334ms (657.610557ms including waiting)
  Normal  Created    5m24s  kubelet            Created container upgrade
  Normal  Started    5m24s  kubelet            Started container upgrade
$ kubectl logs apply-server-plan-on-turing-node-1-with-0e4e3f4e3f8b1e811-48ttz -n system-upgrade
Defaulted container "upgrade" out of: upgrade, cordon (init)
+ upgrade
+ get_k3s_process_info
+ ps -ef
+ grep -E -v '(init|grep|channelserver|supervise-daemon)'
+ grep -E '( |/)k3s .*(server|agent)'
+ awk '{print $2}'
+ K3S_PID=18680
+ echo 18680
+ wc -w
+ '[' 1 '!=' 1 ]
+ '[' -z 18680 ]
+ echo 18680
+ wc -w
+ '[' 1 '!=' 1 ]
+ ps -p 18680 -o 'ppid='
+ awk '{print $1}'
+ K3S_PPID=1
+ info 'K3S binary is running with pid 18680, parent pid 1'
+ echo '[INFO] ' 'K3S binary is running with pid 18680, parent pid 1'
+ '[' 1 '!=' 1 ]
+ '[' 18680 '=' 1 ]
[INFO]  K3S binary is running with pid 18680, parent pid 1
+ awk 'NR==1 {print $1}' /host/proc/18680/cmdline
+ K3S_BIN_PATH=/usr/local/bin/k3s
+ '[' -z /usr/local/bin/k3s ]
+ '[' '!' -e /host/usr/local/bin/k3s ]
+ return
+ replace_binary
+ NEW_BINARY=/opt/k3s
+ FULL_BIN_PATH=/host/usr/local/bin/k3s
+ '[' '!' -f /opt/k3s ]
[INFO]  Comparing old and new binaries
+ info 'Comparing old and new binaries'
+ echo '[INFO] ' 'Comparing old and new binaries'
+ sha256sum /opt/k3s /host/usr/local/bin/k3s
+ BIN_CHECKSUMS='04be543be1c9fbdda30722c5d169099a6972459ea1b1e5df701c42ef54a11f44  /opt/k3s
04be543be1c9fbdda30722c5d169099a6972459ea1b1e5df701c42ef54a11f44  /host/usr/local/bin/k3s'
+ '[' 0 '!=' 0 ]
+ echo '04be543be1c9fbdda30722c5d169099a6972459ea1b1e5df701c42ef54a11f44  /opt/k3s
04be543be1c9fbdda30722c5d169099a6972459ea1b1e5df701c42ef54a11f44  /host/usr/local/bin/k3s'
+ awk '{print $1}'
+ uniq
+ wc -l
+ BIN_COUNT=1
+ '[' 1 '=' 1 ]
+ info 'Binary already been replaced'
+ echo '[INFO] ' 'Binary already been replaced'
+ exit 0
[INFO]  Binary already been replaced
$ kubectl get node turing-node-1 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    etcd.k3s.cattle.io/node-address: 192.168.2.253
    etcd.k3s.cattle.io/node-name: turing-node-1-15e70d0d
    flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"f6:51:16:0b:a0:13"}'
    flannel.alpha.coreos.com/backend-type: vxlan
    flannel.alpha.coreos.com/kube-subnet-manager: "true"
    flannel.alpha.coreos.com/public-ip: 192.168.2.253
    k3s.io/hostname: turing-node-1
    k3s.io/internal-ip: 192.168.2.253,2600:1700:38c2:8a10::41
    k3s.io/node-args: '["server","--cluster-init"]'
    k3s.io/node-config-hash: XFJS3VE5KBQGMQZO4QOV2XN233KRIPYZWEZD3YOPYBRFV6NHLY3A====
    k3s.io/node-env: '{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/4b147cafa965066cd68e04b4e3acce221078156a3b9ba635a653517ce459aa4d"}'
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2023-05-08T04:17:41Z"
  finalizers:
  - wrangler.cattle.io/managed-etcd-controller
  - wrangler.cattle.io/node
  labels:
    beta.kubernetes.io/arch: arm64
    beta.kubernetes.io/instance-type: k3s
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: arm64
    kubernetes.io/hostname: turing-node-1
    kubernetes.io/os: linux
    node-role.kubernetes.io/control-plane: "true"
    node-role.kubernetes.io/etcd: "true"
    node-role.kubernetes.io/master: "true"
    node.kubernetes.io/instance-type: k3s
    plan.upgrade.cattle.io/server-plan: 0e4e3f4e3f8b1e811d841099cb49e4712b93833bee0604115b9a141c
  name: turing-node-1
  resourceVersion: "63216486"
  uid: df0191c4-ce5d-4755-a5d9-f5a989dfebea
spec:
  podCIDR: 10.42.0.0/24
  podCIDRs:
  - 10.42.0.0/24
  providerID: k3s://turing-node-1
status:
  addresses:
  - address: 192.168.2.253
    type: InternalIP
  - address: 2600:1700:38c2:8a10::41
    type: InternalIP
  - address: turing-node-1
    type: Hostname
  allocatable:
    cpu: "4"
    ephemeral-storage: "29559886006"
    memory: 7999972Ki
    pods: "110"
  capacity:
    cpu: "4"
    ephemeral-storage: 30386396Ki
    memory: 7999972Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2023-09-26T17:52:31Z"
    lastTransitionTime: "2023-05-14T17:32:37Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2023-09-26T17:52:31Z"
    lastTransitionTime: "2023-05-14T17:32:37Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2023-09-26T17:52:31Z"
    lastTransitionTime: "2023-05-14T17:32:37Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2023-09-26T17:52:31Z"
    lastTransitionTime: "2023-09-22T19:10:56Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - ghcr.io/home-assistant/home-assistant@sha256:0c4475289186eeadf1b987a6a3df7bbc6d3b33bed6bcb1dbc8d6aabfdaf798ed
    - ghcr.io/home-assistant/home-assistant:2023.3.5
    sizeBytes: 453949277
  - names:
    - ghcr.io/home-assistant/home-assistant@sha256:0a0ae67f5a3121d50890baf1f07baa687468fe448e635e2c34d2b95faf5086b0
    - ghcr.io/home-assistant/home-assistant:2023.3.1
    sizeBytes: 453770753
  - names:
    - ghcr.io/home-assistant/home-assistant@sha256:2c631c99d7078072126e50050b57042ec5548b721f089a87e76dfb24c1071a83
    - ghcr.io/home-assistant/home-assistant:2023.5.4
    sizeBytes: 451027806
  - names:
    - ghcr.io/home-assistant/home-assistant@sha256:d38bc4d21453d6e3e4b0af2b62cf86211b28479946e4e895d4434b3f82c4e58a
    - ghcr.io/home-assistant/home-assistant:2022.12.8
    sizeBytes: 446792583
  - names:
    - docker.io/library/nextcloud@sha256:0ab4b64883b3adf121a3076cd9b8a160a224aa1fa81f75cdb7c4bc4fdeaaa803
    - docker.io/library/nextcloud:26.0.1
    sizeBytes: 347153037
  - names:
    - docker.io/library/mariadb@sha256:37e9f7e3cea0096f7fba9d2a77cf0ac926c830e8931d1679da3bcd8fb8989d47
    - docker.io/library/mariadb:10.6
    sizeBytes: 118893377
  - names:
    - docker.io/pihole/pihole@sha256:dcd0885a3fe050da005cb544904444cc098017636d6d495ac8770a9aa523a0ef
    - docker.io/pihole/pihole:2022.05
    sizeBytes: 111326841
  - names:
    - docker.io/rancher/k3s-upgrade@sha256:6c4543ecde336df20a21f88e5e84399f923bdb3f9bbdc7e815cfdbca643ec50a
    - docker.io/rancher/k3s-upgrade:v1.27.6-k3s1
    sizeBytes: 53055042
  - names:
    - docker.io/anonsoftware28/kubernetes-secret-generator@sha256:1d5bfe7b227caf060d0e61488aecdc40e475f8c8640420fbf7ab500333dcfd60
    - docker.io/anonsoftware28/kubernetes-secret-generator:latest
    sizeBytes: 51735974
  - names:
    - docker.io/rancher/mirrored-library-traefik@sha256:0842af6afcdf4305d17e862bad4eaf379d0817c987eedabeaff334e2273459c1
    - docker.io/rancher/mirrored-library-traefik:2.9.4
    sizeBytes: 35650744
  - names:
    - docker.io/rancher/mirrored-metrics-server@sha256:16185c0d4d01f8919eca4779c69a374c184200cd9e6eded9ba53052fd73578df
    - docker.io/rancher/mirrored-metrics-server:v0.6.2
    sizeBytes: 26205509
  - names:
    - docker.io/dopingus/cert-manager-webhook-dynu@sha256:7958523006f78123305597115cb1ba7f7b448e658549ddb6a089582c4bec8628
    - docker.io/dopingus/cert-manager-webhook-dynu:latest
    sizeBytes: 17882163
  - names:
    - docker.io/dopingus/cert-manager-webhook-dynu@sha256:7618e6678a9f3210ef0ea530a0f58f5932e80aa673729a7ab223a9b24b804cd2
    sizeBytes: 17882150
  - names:
    - registry.k8s.io/sig-storage/nfs-subdir-external-provisioner@sha256:63d5e04551ec8b5aae83b6f35938ca5ddc50a88d85492d9731810c31591fa4c9
    - registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2
    sizeBytes: 16673053
  - names:
    - quay.io/jetstack/cert-manager-controller@sha256:cd9bf3d48b6b8402a2a8b11953f9dc0275ba4beec14da47e31823a0515cde7e2
    - quay.io/jetstack/cert-manager-controller:v1.9.1
    sizeBytes: 15265466
  - names:
    - docker.io/rancher/mirrored-coredns-coredns@sha256:a11fafae1f8037cbbd66c5afa40ba2423936b72b4fd50a7034a7e8b955163594
    - docker.io/rancher/mirrored-coredns-coredns:1.10.1
    sizeBytes: 14556850
  - names:
    - docker.io/rancher/local-path-provisioner@sha256:5bb33992a4ec3034c28b5e0b3c4c2ac35d3613b25b79455eb4b1a95adc82cdc0
    - docker.io/rancher/local-path-provisioner:v0.0.24
    sizeBytes: 13884168
  - names:
    - docker.io/rancher/kubectl@sha256:9be095ca0bbc74e8947a1d4a0258875304b590057d858eb9738de000f88a473e
    - docker.io/rancher/kubectl:v1.25.4
    sizeBytes: 13045642
  - names:
    - quay.io/jetstack/cert-manager-webhook@sha256:4ab2982a220e1c719473d52d8463508422ab26e92664732bfc4d96b538af6b8a
    - quay.io/jetstack/cert-manager-webhook:v1.9.1
    sizeBytes: 12244995
  - names:
    - quay.io/jetstack/cert-manager-cainjector@sha256:df7f0b5186ddb84eccb383ed4b10ec8b8e2a52e0e599ec51f98086af5f4b4938
    - quay.io/jetstack/cert-manager-cainjector:v1.9.1
    sizeBytes: 10909067
  - names:
    - docker.io/rancher/system-upgrade-controller@sha256:c730c4ec8dc914b94be13df77d9b58444277330a2bdf39fe667beb5af2b38c0b
    - docker.io/rancher/system-upgrade-controller:v0.13.1
    sizeBytes: 9617607
  - names:
    - docker.io/rancher/klipper-lb@sha256:2b963c02974155f7e9a51c54b91f09099e48b4550689aadb595e62118e045c10
    - docker.io/rancher/klipper-lb:v0.4.3
    sizeBytes: 4163722
  - names:
    - docker.io/rancher/mirrored-pause@sha256:74c4244427b7312c5b901fe0f67cbc53683d06f4f24c6faee65d4182bf0fa893
    - docker.io/rancher/mirrored-pause:3.6
    sizeBytes: 253243
  nodeInfo:
    architecture: arm64
    bootID: 5bcbbf33-f7d7-4058-a5f4-94a5d26e129c
    containerRuntimeVersion: containerd://1.6.19-k3s1
    kernelVersion: 5.15.32-v8+
    kubeProxyVersion: v1.26.4+k3s1
    kubeletVersion: v1.26.4+k3s1
    machineID: 75a2a6365a604bc389ab0ab7c51c66c6
    operatingSystem: linux
    osImage: Debian GNU/Linux 11 (bullseye)
    systemUUID: 75a2a6365a604bc389ab0ab7c51c66c6
brandond commented 9 months ago
+ sha256sum /opt/k3s /host/usr/local/bin/k3s
+ BIN_CHECKSUMS='04be543be1c9fbdda30722c5d169099a6972459ea1b1e5df701c42ef54a11f44  /opt/k3s
04be543be1c9fbdda30722c5d169099a6972459ea1b1e5df701c42ef54a11f44  /host/usr/local/bin/k3s'

The binary has already been replaced - the checksums match. The upgrade image just checks to see that the binaries have been replaced; it doesn't actually look at what version is currently running.

I suspect that it ran into some sort of problem killing the k3s process to trigger a restart of the service into the new version? Without logs from the original successful upgrade, it's impossible to say why. You might target a different version or channel with your plan, and check the upgrade pod logs afterwards.

The upgrade pods hang around for quite a while after they run, even when successful. How long did you wait after applying the plan, before you went looking to see if it'd actually upgraded or not?

anon-software commented 9 months ago

It must have been at least an hour, but I do not remember exactly. I actually had a plan for the worker nodes too just like they advised in the link and noticed that the job for it was stuck. I assumed it was waiting for the control node upgrade to finish which led me to clean up everything and start with the master node upgrade only and then post this question.

Anyway, after your explanation I have rebooted the cluster and the master node shows the correct version. I shall now retry the agent upgrade plan.