Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.29.6+rke2r1

shindebshekhar commented 2 months ago

Environmental Info: RKE2 Version: v1.28.8+rke2r1

:~ # rke2 -v rke2 version v1.28.8+rke2r1 (42cab2f61939504cb17073e47deaea0b29fe2c1b) go version go1.21.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux hostname 5.3.18-150300.59.161-default #1 SMP Thu May 9 06:59:05 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 Master 3 Worker nodes

Describe the bug:

We are trying to upgrade rke2 from v1.28.8+rke2r1(fresh install) to v1.29.6+rke2r1

After first master upgrade rke2 service comes up but we see all the helm jobs fails for system components calico, coredns, ingress, metrics, vsphere, snapshot controller. Helm Jobs are retriggered in continuous loop(possibly trying to upgrade the above components)

kube-system       helm-install-rancher-vsphere-cpi                0/1           2s         2s
kube-system       helm-install-rancher-vsphere-csi                0/1           1s         1s
kube-system       helm-install-rke2-calico                        0/1           1s         1s
kube-system       helm-install-rke2-calico-crd                    0/1           1s         1s
kube-system       helm-install-rke2-coredns                       0/1           1s         1s
kube-system       helm-install-rke2-ingress-nginx                 0/1           1s         1s
kube-system       helm-install-rke2-metrics-server                0/1           0s         0s
kube-system       helm-install-rke2-snapshot-controller           0/1           10s        10s
kube-system       helm-install-rke2-snapshot-controller-crd       0/1           0s         0s
kube-system       helm-install-rke2-snapshot-validation-webhook   0/1           10s        10s

On further investigation, we found that the tigera opertor has below error:

{"level":"error","ts":"2024-08-14T15:30:00Z","msg":"
Could not wait for Cache to sync
","controller":"tigera-installation-controller","error":"
failed to wait for tigera-installation-controller caches to sync: failed to get informer from cache: no matches for kind \"ImageSet\
" in version \"operator.tigera.io/v1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:242\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
{"level":"error","ts":"2024-08-14T15:30:00Z","msg":"error received after stop sequence was engaged","error":"failed to wait for tigera-installation-controller caches to sync: failed to get informer from cache: no matches for kind \"ImageSet\" in version \"operator.tigera.io/v1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/internal.go:555"}

Also all the crds related to calico are deleted. kubectl get crds | grep -i calico --> No result

brandond commented 2 months ago

Why are the helm install jobs failing? What do the helm job pod logs say?

mugambika commented 2 months ago

Logs from helm-install-rke2-calico-crd pod -

helm_v3 install --set-string global.clusterCIDR=x.x.xx/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x rke2-calico-crd /tmp/rke2-calico-crd.tgz Error: INSTALLATION FAILED: cannot re-use a name that is still in use

brandond commented 2 months ago

Do they all say the same thing? I can't tell which of your job pods have succeeded and which have failed, you trimmed a bunch of stuff off the kubectl output.

Also, please provide the complete pod log, not just bits of it.

mugambika commented 2 months ago

These are the complete list of Jobs in the cluster, calico-crd and calico have Errors-

NAMESPACE         NAME                                            COMPLETIONS   DURATION   AGE
kube-system       helm-install-rancher-vsphere-cpi                0/1           3s         3s
kube-system       helm-install-rancher-vsphere-csi                0/1           3s         3s
kube-system       helm-install-rke2-calico                        0/1           2s         2s
kube-system       helm-install-rke2-calico-crd                    0/1           2s         2s
kube-system       helm-install-rke2-coredns                       0/1           2s         2s
kube-system       helm-install-rke2-ingress-nginx                 0/1           2s         2s
kube-system       helm-install-rke2-metrics-server                0/1           2s         2s
kube-system       helm-install-rke2-snapshot-controller           0/1           1s         1s
kube-system       helm-install-rke2-snapshot-controller-crd       0/1           2s         2s
kube-system       helm-install-rke2-snapshot-validation-webhook   0/1           1s         1s
kube-system       rke2-ingress-nginx-admission-create             0/1           56m        56m
tigera-operator   tigera-operator-uninstall                       0/1           56m        56m

Calico-crd pod logs -

if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
        echo "KUBERNETES_SERVICE_HOST is using IPv6"
        CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
        CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-calico-crd.tgz.base64
+ CHART_PATH=/tmp/rke2-calico-crd.tgz
+ [[ ! -f /chart/rke2-calico-crd.tgz.base64 ]]
+ base64 -d /chart/rke2-calico-crd.tgz.base64
+ CHART=/tmp/rke2-calico-crd.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-calico-crd.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=x.x.x.x/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^rke2-calico-crd$' --namespace kube-system --output json
++ jq -r '"\(.[0].chart),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
+ LINE=rke2-calico-crd-v3.27.002,uninstalling
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ rke2-calico-crd-v3.27.002 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=x.x.x.x/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x rke2-calico-crd /tmp/rke2-calico-crd.tgz
Error: INSTALLATION FAILED: cannot re-use a name that is still in use


Calico pod logs -

if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
        echo "KUBERNETES_SERVICE_HOST is using IPv6"
        CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
        CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-calico.tgz.base64
+ CHART_PATH=/tmp/rke2-calico.tgz
+ [[ ! -f /chart/rke2-calico.tgz.base64 ]]
+ base64 -d /chart/rke2-calico.tgz.base64
+ CHART=/tmp/rke2-calico.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-calico.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=x.x.x.x/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ jq -r '"\(.[0].chart),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
++ helm_v3 ls --all -f '^rke2-calico$' --namespace kube-system --output json
+ LINE=rke2-calico-v3.27.200,uninstalling
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ rke2-calico-v3.27.200 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=x.x.x.x/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x/x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x rke2-calico /tmp/rke2-calico.tgz
Error: INSTALLATION FAILED: cannot re-use a name that is still in use


Helm Pod status, There are a lot of Terminating pods due to the continuous restart of helm jobs and hence few pods are in pending state -

NAMESPACE                  NAME                                                     READY   STATUS              RESTARTS         AGE
kube-system                helm-install-rancher-vsphere-cpi-fgl7g                   0/1     ContainerCreating   0                0s
kube-system                helm-install-rancher-vsphere-csi-tkw6p                   0/1     Pending             0                0s
kube-system                helm-install-rke2-calico-4l5g4                           0/1     Error               1 (4s ago)       7s
kube-system                helm-install-rke2-calico-crd-mbkbh                       0/1     Error               1 (4s ago)       7s
kube-system                helm-install-rke2-coredns-s52hr                          1/1     Running             1 (3s ago)       7s
kube-system                helm-install-rke2-ingress-nginx-v6mdx                    0/1     Pending             0                6s
kube-system                helm-install-rke2-metrics-server-6s2vr                   0/1     Pending             0                6s
kube-system                helm-install-rke2-snapshot-controller-crd-k99hm          0/1     Pending             0                7s
kube-system                helm-install-rke2-snapshot-controller-sxtp6              0/1     Pending             0                7s
kube-system                helm-install-rke2-snapshot-validation-webhook-srzbg      0/1     Pending             0                7s

brandond commented 2 months ago

The jobs and pods are all just seconds old, which suggests that something is thrashing the helmchart resources and causing them to be redeployed. Can you confirm that all of the servers in the cluster have been upgraded to the latest release, and are identically configured? Check the rke2-server logs on all the server nodes to see if they are perhaps stuck in a loop deploying conflicting versions of the HelmChart AddOns. You can also do kubectl get event -A and see if there are an excessive number of events from the addon deploy controller.

mugambika commented 2 months ago

Yes, all the servers are upgraded to latest release. Now we do not see the helm job getting re-deployed as earlier. But the issue with calico-crd and calico pods still remains.

brandond commented 2 months ago

Can you identify what was causing the thrashing? Had you left the cluster only partially upgraded?

It looks like the chart is stuck uninstalling due to the job pod getting killed partway through the process. I would probably try deleting the rke2-calico and rke2-calico-crd helm secrets from the kube-system namespace so that the chart can be successfully reinstalled.

mugambika commented 2 months ago

Yes, after first control-plane upgrade we observed that the helm charts were not updated and hence were troubleshooting on it.

After deleting the helm secrets for rke2-calico and rke2-calico-crd the helm jobs did run and some of them completed. But after that the calico-system namespace is in Terminating state.

mugambika commented 2 months ago

NAMESPACE                  NAME                                            COMPLETIONS   DURATION   AGE
cattle-gatekeeper-system   rancher-gatekeeper-crd-delete                   0/1           85m        85m
kube-system                helm-install-rancher-vsphere-cpi                1/1           10s        107m
kube-system                helm-install-rancher-vsphere-csi                1/1           13s        107m
kube-system                helm-install-rke2-calico                        1/1           72m        107m
kube-system                helm-install-rke2-calico-crd                    1/1           72m        107m
kube-system                helm-install-rke2-coredns                       1/1           12s        107m
kube-system                helm-install-rke2-ingress-nginx                 0/1           107m       107m
kube-system                helm-install-rke2-metrics-server                0/1           107m       107m
kube-system                helm-install-rke2-snapshot-controller           0/1           107m       107m
kube-system                helm-install-rke2-snapshot-controller-crd       0/1           107m       107m
kube-system                helm-install-rke2-snapshot-validation-webhook   0/1           107m       107m
kube-system                rke2-ingress-nginx-admission-create             1/1           5h38m      6h12m
tigera-operator            tigera-operator-uninstall                       0/1           6h12m      6h12m

NAME                          STATUS        AGE
calico-system                 Terminating   8h

brandond commented 2 months ago

Yes, after first control-plane upgrade we observed that the helm charts were not updated and hence were troubleshooting on it.

Don't stop the upgrade partway through to poke at things. Some things may not complete the upgrade until all servers are on the new release. If you end up restarting any servers that are still down-level, they will redeploy older versions of the charts, and end up in a situation like this where they thrash the HelmChart resources back and forth between old and new versions.

brandond commented 2 months ago

You can try removing the finalizers from the namespace, so that it can finish deleting. Once that is done the helmchart should be able to recreate the namespace and any missing resources that were in it.

shindebshekhar commented 2 months ago

@brandond - This issue looks legitimate as we tested 4-5 times.

Our observations -

After we build fresh cluster on rke2 v1.28.8+rke2r1, the upgrade to 1.29.6+rke2r1 fails. The issue is with calico helm chart upgrade where the helm pod goes into crashloopback with error "Error: INSTALLATION FAILED: cannot re-use a name that is still in use". The line status is "LINE=rke2-calico-crd-v3.27.002,uninstalling" which means something is trying to uninstall and helm upgrade code is not able to get "deployed" status and in turn gets "uninstalling" status which code is not able to handle. After deleting the helm secrets the helm job runs successfully but calico system ns goes into terminating. At first place not able to understand why the calico helm pods gets status of uninstall instead of deployed. + LINE=rke2-calico-v3.27.200,uninstalling
On existing cluster which were build on 1.27.10+rke2r1, we are successfully able to upgrade on 1.28.8+rke2r1 and then to 1.29.6+rke2r1 with no calico issues

rancher / rke2

Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.29.6+rke2r1 #6568