rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.57k stars 268 forks source link

Helm charts upgrade fails after updating minor version from v1.26.10+rke2r1 to 1.26.15+rke2r1 #7094

Closed gercoss-bgh closed 2 weeks ago

gercoss-bgh commented 3 weeks ago

Environmental Info: RKE2 Version: v1.26.15+rke2r1

Node(s) CPU architecture, OS, and Version:

Kernel Version: 4.18.0-513.24.1.el8_9.x86_64 OS Image: Red Hat Enterprise Linux 8.9 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.7.11-k3s2 Kubelet Version: v1.26.15+rke2r1 Kube-Proxy Version: v1.26.15+rke2r1

Cluster Configuration: 3 servers, 3 agents

Describe the bug:

After upgrading from v1.26.10+rke2r1 to v1.26.15+rke2r1, Helm charts sporadically attempt to reinstall. The helm pods occasionally fail with logs similar to:

k logs helm-install-rke2-ingress-nginx-r77dw -n kube-system
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
        echo "KUBERNETES_SERVICE_HOST is using IPv6"
        CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
        CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ '' != \t\r\u\e ]]
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/
+ tiller --listen=127.0.0.1:44134 --storage=secret
[main] 2024/10/22 17:11:05 Starting Tiller v2.17.0 (tls=false)
[main] 2024/10/22 17:11:05 GRPC listening on 127.0.0.1:44134
[main] 2024/10/22 17:11:05 Probes listening on :44135
[main] 2024/10/22 17:11:05 Storage driver is Secret
[main] 2024/10/22 17:11:05 Max history per release is 0
$HELM_HOME has been configured at /home/klipper-helm/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ timeout -s KILL 30 helm_v2 ls --all '^rke2-ingress-nginx$' --output json
++ jq -r '.Releases | length'
[storage] 2024/10/22 17:11:05 listing all releases with filter
+ V2_CHART_EXISTS=
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-ingress-nginx.tgz.base64
+ CHART_PATH=/tmp/rke2-ingress-nginx.tgz
+ [[ ! -f /chart/rke2-ingress-nginx.tgz.base64 ]]
+ base64 -d /chart/rke2-ingress-nginx.tgz.base64
+ CHART=/tmp/rke2-ingress-nginx.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-ingress-nginx.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^rke2-ingress-nginx$' --namespace kube-system --output json
++ jq -r '"\(.[0].chart),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
+ LINE=rke2-ingress-nginx-4.8.200,uninstalling
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ rke2-ingress-nginx-4.8.200 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 rke2-ingress-nginx /tmp/rke2-ingress-nginx.tgz
Error: INSTALLATION FAILED: cannot re-use a name that is still in use

In particular, this happens with the rke2-ingress-nginx Helm chart, which gets marked for uninstallation and then fails upon reinstallation due to the chart name already being in use.

Steps To Reproduce:

RKE2 Installation: Perform a minor upgrade from v1.26.10+rke2r1 to v1.26.15+rke2r1. Observe Helm charts intermittently attempting to reinstall. Method: Airgap https://docs.rke2.io/install/airgap

Expected behavior:

Helm charts should be installed once without repeated reinstallation attempts.

Actual behavior:

RKE2 repeatedly attempts to install Helm charts even when they are already installed. This primarily affects the rke2-ingress-nginx chart, which first gets marked for uninstallation but then fails during reinstallation. Each time the issue occurs, we manually delete the Helm chart's associated secrets, delete the pod, and it is recreated and installed successfully. However, the issue recurs, as if the Helm chart installations are stuck in a loop.

Additional context / logs:

Logs from Journalctl -x rke2-server: journalctl-server-node.txt

brandond commented 3 weeks ago

LINE=rke2-ingress-nginx-4.8.200,uninstalling

The current chart is stuck in the uninstalling state. This usually occurs when the upgrade is interrupted for some reason, so the chart controller tries to uninstall and reinstall it again, and this process is (again) interrupted.

The most frequent cause of interrupted helm jobs is rebooting or restarting the node that is running the upgrade job; if possible you should give nodes more time to settle during the upgrade process.

The helm job pod in the version of RKE2 that you're using does not handle the chart getting stuck in uninstalling. RKE2 1.26 is end of life, and newer releases handle this better. If you are unable to upgrade, you'd need to delete the Helm rke2-ingress-nginx release secret from the kube-system namespace, so that the chart can be successfully reinstalled - assuming that you do not again interrupt the helm job pod.

gercoss-bgh commented 3 weeks ago

Hi @brandond! Thanks for the quick response! We are planning to upgrade to v1.29.9+rke2r1, but before doing so, we'd like to better understand why this issue is occurring during our current upgrade process.

During the node upgrades, we encountered a similar issue where the Helm pods failed, even though the nodes were not rebooted or restarted. Would doing the upgrade be the better option?

brandond commented 3 weeks ago

You do need to step through Kubernetes minors, so directly upgrading to 1.29 is not an option - you need to go to the latest 1.27, then 1.28, then 1.29.

We don't generally see a lot of helm jobs get stuck. Some amount of retries are to be expected while the upgrade is in progress. If you're doing this by hand, I would probably upgrade all the server nodes, then wait for the Helm jobs to complete successfully (it may take a few retries), and then upgrade the agents.