rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.51k stars 264 forks source link

Calico Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.28.12+rke2r1 / v1.29.6+rke2r1 #6633

Open shindebshekhar opened 2 weeks ago

shindebshekhar commented 2 weeks ago

Environmental Info: RKE2 Version: v1.28.8+rke2r1

:~ # rke2 -v rke2 version v1.28.8+rke2r1 (42cab2f61939504cb17073e47deaea0b29fe2c1b) go version go1.21.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux hostname 5.3.18-150300.59.161-default #1 SMP Thu May 9 06:59:05 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 Master 3 Worker nodes

Describe the bug:

We are trying to upgrade rke2 from v1.28.8+rke2r1(fresh install) to v1.28.12+rke2r1 / v1.29.6+rke2r1

After upgrade rke2 service comes up but we see all the helm jobs fails for system component calico. Helm Jobs are retriggered in continuous loop(possibly trying to upgrade the above components)

For some reason instead of upgrading the calico chart, It tries to uninstall the tigera operator CRDs and calico CRDs. In this process it hangs as resources are still present. Please see below log output for calico CRD job.

kubectl get crds | grep -i calico --> No result

kubectl logs job/helm-install-rke2-calico-crd -n kube-system -f

if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
        echo "KUBERNETES_SERVICE_HOST is using IPv6"
        CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
        CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-calico-crd.tgz.base64
+ CHART_PATH=/tmp/rke2-calico-crd.tgz
+ [[ ! -f /chart/rke2-calico-crd.tgz.base64 ]]
+ base64 -d /chart/rke2-calico-crd.tgz.base64
+ CHART=/tmp/rke2-calico-crd.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-calico-crd.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=192.168.128.0/17 --set-string global.clusterCIDRv4=192.168.128.0/17 --set-string global.clusterDNS=192.168.64.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=192.168.64.0/18
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^rke2-calico-crd$' --namespace kube-system --output json
++ jq -r '"\(.[0].chart),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
**+ LINE=rke2-calico-crd-v3.27.002,uninstalling**
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ rke2-calico-crd-v3.27.002 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=192.168.128.0/17 --set-string global.clusterCIDRv4=192.168.128.0/17 --set-string global.clusterDNS=192.168.64.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=192.168.64.0/18 rke2-calico-crd /tmp/rke2-calico-crd.tgz
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
brandond commented 2 weeks ago

It looks like for some reason the Helm job to upgrade the chart was interrupted while upgrading the chart. The helm controller responded by trying to uninstall and reinstall the chart, but the uninstall job was also interrupted - so now the chart is stuck in the "uninstalling" status.

You might try deleting the Helm secrets for the rke2-calico-crd release, and rke-calico as well if necessary. This should allow it to successfully reinstall the chart.

What process did you use to upgrade your cluster? We do not generally see issues with the Helm jobs being interrupted while upgrading, unless the upgrade is interrupted partway through, leaving nodes deploying conflicting component versions.

rjchicago commented 4 days ago

Was there any recovery from this? We ran into this issue yesterday and had to restore controller VM and etcd from snapshots.

The symptoms and logs match exactly what was posted above. We initially attempted to install the CRDs and recreate the required resources, but calico controller continued to crashloop.

Ultimately, the restore from snapshots worked, but we actually had to do that twice as after adding additional controllers ,the helm upgrade was re-triggered and we had to restart the process. We're now currently running with just the one controller - not an ideal state.