Closed shindebshekhar closed 2 months ago
Why are the helm install jobs failing? What do the helm job pod logs say?
Logs from helm-install-rke2-calico-crd pod -
helm_v3 install --set-string global.clusterCIDR=x.x.xx/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x rke2-calico-crd /tmp/rke2-calico-crd.tgz Error: INSTALLATION FAILED: cannot re-use a name that is still in use
Do they all say the same thing? I can't tell which of your job pods have succeeded and which have failed, you trimmed a bunch of stuff off the kubectl output.
Also, please provide the complete pod log, not just bits of it.
These are the complete list of Jobs in the cluster, calico-crd and calico have Errors-
NAMESPACE NAME COMPLETIONS DURATION AGE
kube-system helm-install-rancher-vsphere-cpi 0/1 3s 3s
kube-system helm-install-rancher-vsphere-csi 0/1 3s 3s
kube-system helm-install-rke2-calico 0/1 2s 2s
kube-system helm-install-rke2-calico-crd 0/1 2s 2s
kube-system helm-install-rke2-coredns 0/1 2s 2s
kube-system helm-install-rke2-ingress-nginx 0/1 2s 2s
kube-system helm-install-rke2-metrics-server 0/1 2s 2s
kube-system helm-install-rke2-snapshot-controller 0/1 1s 1s
kube-system helm-install-rke2-snapshot-controller-crd 0/1 2s 2s
kube-system helm-install-rke2-snapshot-validation-webhook 0/1 1s 1s
kube-system rke2-ingress-nginx-admission-create 0/1 56m 56m
tigera-operator tigera-operator-uninstall 0/1 56m 56m
Calico-crd pod logs -
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
echo "KUBERNETES_SERVICE_HOST is using IPv6"
CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi
set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-calico-crd.tgz.base64
+ CHART_PATH=/tmp/rke2-calico-crd.tgz
+ [[ ! -f /chart/rke2-calico-crd.tgz.base64 ]]
+ base64 -d /chart/rke2-calico-crd.tgz.base64
+ CHART=/tmp/rke2-calico-crd.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-calico-crd.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=x.x.x.x/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^rke2-calico-crd$' --namespace kube-system --output json
++ jq -r '"\(.[0].chart),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
+ LINE=rke2-calico-crd-v3.27.002,uninstalling
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ rke2-calico-crd-v3.27.002 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=x.x.x.x/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x rke2-calico-crd /tmp/rke2-calico-crd.tgz
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
Calico pod logs -
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
echo "KUBERNETES_SERVICE_HOST is using IPv6"
CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi
set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-calico.tgz.base64
+ CHART_PATH=/tmp/rke2-calico.tgz
+ [[ ! -f /chart/rke2-calico.tgz.base64 ]]
+ base64 -d /chart/rke2-calico.tgz.base64
+ CHART=/tmp/rke2-calico.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-calico.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=x.x.x.x/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ jq -r '"\(.[0].chart),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
++ helm_v3 ls --all -f '^rke2-calico$' --namespace kube-system --output json
+ LINE=rke2-calico-v3.27.200,uninstalling
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ rke2-calico-v3.27.200 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=x.x.x.x/x --set-string global.clusterCIDRv4=x.x.x.x/x --set-string global.clusterDNS=x.x.x.x/x --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=x.x.x.x/x rke2-calico /tmp/rke2-calico.tgz
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
Helm Pod status, There are a lot of Terminating pods due to the continuous restart of helm jobs and hence few pods are in pending state -
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system helm-install-rancher-vsphere-cpi-fgl7g 0/1 ContainerCreating 0 0s
kube-system helm-install-rancher-vsphere-csi-tkw6p 0/1 Pending 0 0s
kube-system helm-install-rke2-calico-4l5g4 0/1 Error 1 (4s ago) 7s
kube-system helm-install-rke2-calico-crd-mbkbh 0/1 Error 1 (4s ago) 7s
kube-system helm-install-rke2-coredns-s52hr 1/1 Running 1 (3s ago) 7s
kube-system helm-install-rke2-ingress-nginx-v6mdx 0/1 Pending 0 6s
kube-system helm-install-rke2-metrics-server-6s2vr 0/1 Pending 0 6s
kube-system helm-install-rke2-snapshot-controller-crd-k99hm 0/1 Pending 0 7s
kube-system helm-install-rke2-snapshot-controller-sxtp6 0/1 Pending 0 7s
kube-system helm-install-rke2-snapshot-validation-webhook-srzbg 0/1 Pending 0 7s
The jobs and pods are all just seconds old, which suggests that something is thrashing the helmchart resources and causing them to be redeployed. Can you confirm that all of the servers in the cluster have been upgraded to the latest release, and are identically configured? Check the rke2-server logs on all the server nodes to see if they are perhaps stuck in a loop deploying conflicting versions of the HelmChart AddOns. You can also do kubectl get event -A
and see if there are an excessive number of events from the addon deploy controller.
Yes, all the servers are upgraded to latest release. Now we do not see the helm job getting re-deployed as earlier. But the issue with calico-crd and calico pods still remains.
Can you identify what was causing the thrashing? Had you left the cluster only partially upgraded?
It looks like the chart is stuck uninstalling due to the job pod getting killed partway through the process. I would probably try deleting the rke2-calico and rke2-calico-crd helm secrets from the kube-system namespace so that the chart can be successfully reinstalled.
Yes, after first control-plane upgrade we observed that the helm charts were not updated and hence were troubleshooting on it.
After deleting the helm secrets for rke2-calico and rke2-calico-crd the helm jobs did run and some of them completed. But after that the calico-system namespace is in Terminating state.
NAMESPACE NAME COMPLETIONS DURATION AGE
cattle-gatekeeper-system rancher-gatekeeper-crd-delete 0/1 85m 85m
kube-system helm-install-rancher-vsphere-cpi 1/1 10s 107m
kube-system helm-install-rancher-vsphere-csi 1/1 13s 107m
kube-system helm-install-rke2-calico 1/1 72m 107m
kube-system helm-install-rke2-calico-crd 1/1 72m 107m
kube-system helm-install-rke2-coredns 1/1 12s 107m
kube-system helm-install-rke2-ingress-nginx 0/1 107m 107m
kube-system helm-install-rke2-metrics-server 0/1 107m 107m
kube-system helm-install-rke2-snapshot-controller 0/1 107m 107m
kube-system helm-install-rke2-snapshot-controller-crd 0/1 107m 107m
kube-system helm-install-rke2-snapshot-validation-webhook 0/1 107m 107m
kube-system rke2-ingress-nginx-admission-create 1/1 5h38m 6h12m
tigera-operator tigera-operator-uninstall 0/1 6h12m 6h12m
NAME STATUS AGE
calico-system Terminating 8h
Yes, after first control-plane upgrade we observed that the helm charts were not updated and hence were troubleshooting on it.
Don't stop the upgrade partway through to poke at things. Some things may not complete the upgrade until all servers are on the new release. If you end up restarting any servers that are still down-level, they will redeploy older versions of the charts, and end up in a situation like this where they thrash the HelmChart resources back and forth between old and new versions.
You can try removing the finalizers from the namespace, so that it can finish deleting. Once that is done the helmchart should be able to recreate the namespace and any missing resources that were in it.
@brandond - This issue looks legitimate as we tested 4-5 times.
Our observations -
After we build fresh cluster on rke2 v1.28.8+rke2r1, the upgrade to 1.29.6+rke2r1 fails. The issue is with calico helm chart upgrade where the helm pod goes into crashloopback with error "Error: INSTALLATION FAILED: cannot re-use a name that is still in use". The line status is "LINE=rke2-calico-crd-v3.27.002,uninstalling" which means something is trying to uninstall and helm upgrade code is not able to get "deployed" status and in turn gets "uninstalling" status which code is not able to handle.
After deleting the helm secrets the helm job runs successfully but calico system ns goes into terminating.
At first place not able to understand why the calico helm pods gets status of uninstall instead of deployed.
+ LINE=rke2-calico-v3.27.200,uninstalling
On existing cluster which were build on 1.27.10+rke2r1, we are successfully able to upgrade on 1.28.8+rke2r1 and then to 1.29.6+rke2r1 with no calico issues
Environmental Info: RKE2 Version: v1.28.8+rke2r1
:~ # rke2 -v rke2 version v1.28.8+rke2r1 (42cab2f61939504cb17073e47deaea0b29fe2c1b) go version go1.21.8 X:boringcrypto
Node(s) CPU architecture, OS, and Version:
Linux hostname 5.3.18-150300.59.161-default #1 SMP Thu May 9 06:59:05 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 Master 3 Worker nodes
Describe the bug:
We are trying to upgrade rke2 from v1.28.8+rke2r1(fresh install) to v1.29.6+rke2r1
After first master upgrade rke2 service comes up but we see all the helm jobs fails for system components calico, coredns, ingress, metrics, vsphere, snapshot controller. Helm Jobs are retriggered in continuous loop(possibly trying to upgrade the above components)
On further investigation, we found that the tigera opertor has below error:
Also all the crds related to calico are deleted. kubectl get crds | grep -i calico --> No result