Closed gercoss-bgh closed 2 weeks ago
LINE=rke2-ingress-nginx-4.8.200,uninstalling
The current chart is stuck in the uninstalling
state. This usually occurs when the upgrade is interrupted for some reason, so the chart controller tries to uninstall and reinstall it again, and this process is (again) interrupted.
The most frequent cause of interrupted helm jobs is rebooting or restarting the node that is running the upgrade job; if possible you should give nodes more time to settle during the upgrade process.
The helm job pod in the version of RKE2 that you're using does not handle the chart getting stuck in uninstalling
. RKE2 1.26 is end of life, and newer releases handle this better. If you are unable to upgrade, you'd need to delete the Helm rke2-ingress-nginx release secret from the kube-system namespace, so that the chart can be successfully reinstalled - assuming that you do not again interrupt the helm job pod.
Hi @brandond! Thanks for the quick response! We are planning to upgrade to v1.29.9+rke2r1, but before doing so, we'd like to better understand why this issue is occurring during our current upgrade process.
During the node upgrades, we encountered a similar issue where the Helm pods failed, even though the nodes were not rebooted or restarted. Would doing the upgrade be the better option?
You do need to step through Kubernetes minors, so directly upgrading to 1.29 is not an option - you need to go to the latest 1.27, then 1.28, then 1.29.
We don't generally see a lot of helm jobs get stuck. Some amount of retries are to be expected while the upgrade is in progress. If you're doing this by hand, I would probably upgrade all the server nodes, then wait for the Helm jobs to complete successfully (it may take a few retries), and then upgrade the agents.
Environmental Info: RKE2 Version: v1.26.15+rke2r1
Node(s) CPU architecture, OS, and Version:
Kernel Version: 4.18.0-513.24.1.el8_9.x86_64 OS Image: Red Hat Enterprise Linux 8.9 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.7.11-k3s2 Kubelet Version: v1.26.15+rke2r1 Kube-Proxy Version: v1.26.15+rke2r1
Cluster Configuration: 3 servers, 3 agents
Describe the bug:
After upgrading from v1.26.10+rke2r1 to v1.26.15+rke2r1, Helm charts sporadically attempt to reinstall. The helm pods occasionally fail with logs similar to:
In particular, this happens with the rke2-ingress-nginx Helm chart, which gets marked for uninstallation and then fails upon reinstallation due to the chart name already being in use.
Steps To Reproduce:
RKE2 Installation: Perform a minor upgrade from v1.26.10+rke2r1 to v1.26.15+rke2r1. Observe Helm charts intermittently attempting to reinstall. Method: Airgap https://docs.rke2.io/install/airgap
Expected behavior:
Helm charts should be installed once without repeated reinstallation attempts.
Actual behavior:
RKE2 repeatedly attempts to install Helm charts even when they are already installed. This primarily affects the rke2-ingress-nginx chart, which first gets marked for uninstallation but then fails during reinstallation. Each time the issue occurs, we manually delete the Helm chart's associated secrets, delete the pod, and it is recreated and installed successfully. However, the issue recurs, as if the Helm chart installations are stuck in a loop.
Additional context / logs:
Logs from Journalctl -x rke2-server: journalctl-server-node.txt