Closed talavis closed 1 year ago
Hitting the same issue when upgrading using the terraform RKE Provider:
│ time="2022-08-31T05:27:37Z" level=info msg="[controlplane] Successfully started Controller Plane.."
│ time="2022-08-31T05:27:37Z" level=info msg="[worker] Building up Worker Plane.."
│ time="2022-08-31T05:27:37Z" level=info msg="Finding container [service-sidekick] on host [10.170.100.247], try #1"
│ time="2022-08-31T05:27:37Z" level=info msg="[sidekick] Sidekick container already created on host [10.170.100.247]"
│ time="2022-08-31T05:27:37Z" level=info msg="Restarting container [kubelet] on host [10.170.100.247], try #1"
│ time="2022-08-31T05:27:37Z" level=info msg="[healthcheck] Start Healthcheck on service [kubelet] on host [10.170.100.247]"
│ time="2022-08-31T05:28:42Z" level=error msg="Failed to upgrade worker components on NotReady hosts, error: [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [10.170.100.247]: Get \"http://localhost:10248/healthz\": Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: ]"
│ time="2022-08-31T05:28:42Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #1"
│ time="2022-08-31T05:28:47Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #2"
│ time="2022-08-31T05:28:52Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #3"
│ time="2022-08-31T05:28:57Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #4"
│ time="2022-08-31T05:29:02Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #5"
│ time="2022-08-31T05:29:07Z" level=error msg="Host 10.170.100.247 failed to report Ready status with error: host 10.170.100.247 not ready"
│ time="2022-08-31T05:29:07Z" level=info msg="[controlplane] Processing controlplane hosts for upgrade 1 at a time"
│ time="2022-08-31T05:29:07Z" level=info msg="Processing controlplane host 10.170.100.247"
│ time="2022-08-31T05:29:07Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #1"
│ time="2022-08-31T05:29:12Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #2"
│ time="2022-08-31T05:29:17Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #3"
│ time="2022-08-31T05:29:22Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #4"
│ time="2022-08-31T05:29:27Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #5"
│ time="2022-08-31T05:29:32Z" level=error msg="Failed to upgrade hosts: 10.170.100.247 with error [host 10.170.100.247 not ready]"
No kubelet container is started on the host
Same issue with 1.3.14.
Same problem here. 1.3.12 works fine 1.3.13 or 1.3.14 don't work.
Ok a few more details. I reinstalled all the nodes and deployed them with RKE 1.3.12. In the config I added:
enable_cri_dockerd: true
kubernetes_version: "v1.23.7-rancher1-1"
With these additional settings I deployed the cluster and everything works fine. Then I updated RKE to 1.3.13 and redeployed the Cluster with no further changes to the config. Still works fine.
Then I changed the Kubernetes version to 1.23.8-rancher1-1
and redeployed the Cluster. Everything works just like before.
I then switched the Kubernetes version in the config to 1.24.2-rancher1-1
and started a redeployment of the Cluster. This fails with the following error:
ERRO[0396] Failed to upgrade hosts: node1.local.de with error [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [node1.local.de]: Get "http://localhost:10248/healthz": Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: time="2022-09-21T09:12:06Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"]
FATA[0396] [controlPlane] Failed to upgrade Control Plane: [[Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [node1.local.de]: Get "http://localhost:10248/healthz": Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: time="2022-09-21T09:12:06Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"]]
So it isnt really RKE that is causing this but the version change in Kubernetes from 1.23 to 1.24. I also confirmed this with updating from RKE 1.3.12 to 1.3.14 and directly installing RKE 1.3.13 and 1.3.14 on fresh installed nodes without a prior running Cluster. Fixing the Kubernetes version to a 1.23 release in the RKE config makes the deployment a success. 1.24 releases will cause an error during deployment.
Anything I could check for further hints as to what is causing this?
Thank you @kagehisa for that information. The issue does indeed seem to be 1.24. We successfully got 1.23.10 running on our cluster with the latest rke (1.3.15). Still no luck with 1.24.
It would be good to know why it fails. The only major difference between 1.23 and 1.24 is that they removed everything docker shim related. Is there something that needs to be done for a RKE deployed Cluster to successfully use the external docker shim? Or is there another Problem with RKE and Kubernetes 1.24? Would be nice to receive some feedback from a RKE dev in this regard, this is a serious problem.
I ran into a similar issue today trying to update from 1.23 to 1.24 with RKE 1.4.0. The issue seems to be that docker on 1.24 needs way more CPU resources than it did on 1.23. I had a test cluster with 3 nodes with 4 cores each and it ran just fine, after the update I increased the cpu cores to 8 and docker still used all cpu cores with 100% utilization. Htop showed a load of 17.
The RKE Update subsequently runs into failures because the docker daemon doesn't react fast enough to requests from the update process.
For the abnormal CPU usage, I think we should track https://github.com/rancher/rancher/issues/38816
I have the same problem with docker daemon being slow to answer as @schnapsidee stated. Is there an option to increase the timeout rke waits for the node to response?
@sajjadG , did you manage to increase this timeout?
@Amitk3293 No!
But I rke remove
ed the node and tried rke up
again and it worked!
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.
The CPU issue has been fixed: upgraded v1.23.16-rancher2-1 -> v1.24.13-rancher2-1 with enable_cri_dockerd: true
and the CPU usage is normal.
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.
RKE version: 1.3.13
Docker version: (
docker version
,docker info
preferred)Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Bare-metal+KVM.
cluster.yml file:
Steps to Reproduce: Simply run
./rke_linux-amd64
and wait until it fails.Results:
The first node (
dcnode01-master
) to be upgraded was turned into a "unreachable" state.Never had any issues with upgrading the same cluster for the last two years. The node/cluster was returned back to normal after downgrading to 1.3.12 using the same
cluster.yml
without any issues at all.