rancher / rke

Rancher Kubernetes Engine (RKE), an extremely simple, lightning fast Kubernetes distribution that runs entirely within containers.
Apache License 2.0
3.22k stars 583 forks source link

Upgrade from RKE 1.3.12 to 1.3.13 failing #3019

Closed talavis closed 1 year ago

talavis commented 2 years ago

RKE version: 1.3.13

Docker version: (docker version,docker info preferred)

$ docker --version
Docker version 20.10.17, build 100c701

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
$ uname -r
4.15.0-191-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Bare-metal+KVM.

cluster.yml file:

cluster_name: dc-kube

nodes:
  # dcnode01-master
  - address: 192.168.81.138
    internal_address: 192.168.81.134
    hostname_override: dcnode01-master
    user: ubuntu
    ssh_key_path: /deployments/secrets/kube-cluster/id_rsa
    role: [controlplane,etcd]
    labels:
      host: dcnode01

  # dcnode01-worker
  - address: 192.168.81.139
    internal_address: 192.168.81.135
    hostname_override: dcnode01-worker
    user: ubuntu
    ssh_key_path: /deployments/secrets/kube-cluster/id_rsa
    role: [worker]
    labels:
      host: dcnode01
      ingress: nginx
      acceleration: none

  # dcnode02-master
  - address: 192.168.81.140
    internal_address: 192.168.81.136
    hostname_override: dcnode02-master
    user: ubuntu
    ssh_key_path: /deployments/secrets/kube-cluster/id_rsa
    role: [controlplane,etcd]
    labels:
      host: dcnode02

  # dcnode02-worker
  - address: 192.168.81.141
    internal_address: 192.168.81.137
    hostname_override: dcnode02-worker
    user: ubuntu
    ssh_key_path: /deployments/secrets/kube-cluster/id_rsa
    role: [worker]
    labels:
      host: dcnode02
      ingress: nginx
      acceleration: none

  # dcgpu01
  - address: 192.168.81.142
    internal_address: 192.168.81.143
    hostname_override: dc-kub-gpu01
    user: ubuntu
    port: 26490
    ssh_key_path: /deployments/secrets/kube-cluster/id_rsa
    role: [controlplane,etcd,worker]
    labels:
      host: dcgpu01
      ingress: nginx
      acceleration: gpu

ingress:
  provider: nginx
  node_selector:
    ingress: nginx
  options:
    proxy-body-size: 0

services:
  etcd:
    snapshot: true
    creation: 2h
    retention: 24h
  kubelet:
    extra_args:
      image-pull-progress-deadline: 5m
      max-pods: 110

network:
  options:
    flannel_iface: ens4
  plugin: flannel

# use the external dockershim
enable_cri_dockerd: true

Steps to Reproduce: Simply run ./rke_linux-amd64 and wait until it fails.

Results:

$ ./rke_linux-amd64 --version
rke version v1.3.13
$ ./rke_linux-amd64 up
/.../
INFO[0078] [healthcheck] Start Healthcheck on service [kubelet] on host [192.168.81.138]
ERRO[0132] Failed to upgrade worker components on NotReady hosts, error: [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [192.168.81.138]: Get "http://localhost:10248/healthz": Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: + '[' kubelet = kubelet ']']
INFO[0132] [controlplane] Now checking status of node dcnode01-master, try #1
INFO[0137] [controlplane] Now checking status of node dcnode01-master, try #2
INFO[0142] [controlplane] Now checking status of node dcnode01-master, try #3
INFO[0147] [controlplane] Now checking status of node dcnode01-master, try #4
INFO[0152] [controlplane] Now checking status of node dcnode01-master, try #5
ERRO[0157] Host dcnode01-master failed to report Ready status with error: host dcnode01-master not ready
INFO[0157] [controlplane] Processing controlplane hosts for upgrade 1 at a time
INFO[0157] Processing controlplane host dcnode01-master
INFO[0157] [controlplane] Now checking status of node dcnode01-master, try #1
INFO[0162] [controlplane] Now checking status of node dcnode01-master, try #2
INFO[0167] [controlplane] Now checking status of node dcnode01-master, try #3
INFO[0172] [controlplane] Now checking status of node dcnode01-master, try #4
INFO[0177] [controlplane] Now checking status of node dcnode01-master, try #5
ERRO[0182] Failed to upgrade hosts: dcnode01-master with error [host dcnode01-master not ready]
FATA[0182] [controlPlane] Failed to upgrade Control Plane: [[host dcnode01-master not ready]]

The first node (dcnode01-master) to be upgraded was turned into a "unreachable" state.

Never had any issues with upgrading the same cluster for the last two years. The node/cluster was returned back to normal after downgrading to 1.3.12 using the same cluster.yml without any issues at all.

LinAnt commented 2 years ago

Hitting the same issue when upgrading using the terraform RKE Provider:

│ time="2022-08-31T05:27:37Z" level=info msg="[controlplane] Successfully started Controller Plane.."
│ time="2022-08-31T05:27:37Z" level=info msg="[worker] Building up Worker Plane.."
│ time="2022-08-31T05:27:37Z" level=info msg="Finding container [service-sidekick] on host [10.170.100.247], try #1"
│ time="2022-08-31T05:27:37Z" level=info msg="[sidekick] Sidekick container already created on host [10.170.100.247]"
│ time="2022-08-31T05:27:37Z" level=info msg="Restarting container [kubelet] on host [10.170.100.247], try #1"
│ time="2022-08-31T05:27:37Z" level=info msg="[healthcheck] Start Healthcheck on service [kubelet] on host [10.170.100.247]"
│ time="2022-08-31T05:28:42Z" level=error msg="Failed to upgrade worker components on NotReady hosts, error: [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [10.170.100.247]: Get \"http://localhost:10248/healthz\": Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: ]"
│ time="2022-08-31T05:28:42Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #1"
│ time="2022-08-31T05:28:47Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #2"
│ time="2022-08-31T05:28:52Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #3"
│ time="2022-08-31T05:28:57Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #4"
│ time="2022-08-31T05:29:02Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #5"
│ time="2022-08-31T05:29:07Z" level=error msg="Host 10.170.100.247 failed to report Ready status with error: host 10.170.100.247 not ready"
│ time="2022-08-31T05:29:07Z" level=info msg="[controlplane] Processing controlplane hosts for upgrade 1 at a time"
│ time="2022-08-31T05:29:07Z" level=info msg="Processing controlplane host 10.170.100.247"
│ time="2022-08-31T05:29:07Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #1"
│ time="2022-08-31T05:29:12Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #2"
│ time="2022-08-31T05:29:17Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #3"
│ time="2022-08-31T05:29:22Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #4"
│ time="2022-08-31T05:29:27Z" level=info msg="[controlplane] Now checking status of node 10.170.100.247, try #5"
│ time="2022-08-31T05:29:32Z" level=error msg="Failed to upgrade hosts: 10.170.100.247 with error [host 10.170.100.247 not ready]"

No kubelet container is started on the host

talavis commented 2 years ago

Same issue with 1.3.14.

kagehisa commented 2 years ago

Same problem here. 1.3.12 works fine 1.3.13 or 1.3.14 don't work.

kagehisa commented 2 years ago

Ok a few more details. I reinstalled all the nodes and deployed them with RKE 1.3.12. In the config I added:

enable_cri_dockerd: true
kubernetes_version: "v1.23.7-rancher1-1" 

With these additional settings I deployed the cluster and everything works fine. Then I updated RKE to 1.3.13 and redeployed the Cluster with no further changes to the config. Still works fine. Then I changed the Kubernetes version to 1.23.8-rancher1-1 and redeployed the Cluster. Everything works just like before. I then switched the Kubernetes version in the config to 1.24.2-rancher1-1 and started a redeployment of the Cluster. This fails with the following error:

ERRO[0396] Failed to upgrade hosts: node1.local.de with error [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [node1.local.de]: Get "http://localhost:10248/healthz": Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: time="2022-09-21T09:12:06Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"]
FATA[0396] [controlPlane] Failed to upgrade Control Plane: [[Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [node1.local.de]: Get "http://localhost:10248/healthz": Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: time="2022-09-21T09:12:06Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"]]

So it isnt really RKE that is causing this but the version change in Kubernetes from 1.23 to 1.24. I also confirmed this with updating from RKE 1.3.12 to 1.3.14 and directly installing RKE 1.3.13 and 1.3.14 on fresh installed nodes without a prior running Cluster. Fixing the Kubernetes version to a 1.23 release in the RKE config makes the deployment a success. 1.24 releases will cause an error during deployment.

Anything I could check for further hints as to what is causing this?

talavis commented 2 years ago

Thank you @kagehisa for that information. The issue does indeed seem to be 1.24. We successfully got 1.23.10 running on our cluster with the latest rke (1.3.15). Still no luck with 1.24.

kagehisa commented 2 years ago

It would be good to know why it fails. The only major difference between 1.23 and 1.24 is that they removed everything docker shim related. Is there something that needs to be done for a RKE deployed Cluster to successfully use the external docker shim? Or is there another Problem with RKE and Kubernetes 1.24? Would be nice to receive some feedback from a RKE dev in this regard, this is a serious problem.

schnapsidee commented 2 years ago

I ran into a similar issue today trying to update from 1.23 to 1.24 with RKE 1.4.0. The issue seems to be that docker on 1.24 needs way more CPU resources than it did on 1.23. I had a test cluster with 3 nodes with 4 cores each and it ran just fine, after the update I increased the cpu cores to 8 and docker still used all cpu cores with 100% utilization. Htop showed a load of 17.

The RKE Update subsequently runs into failures because the docker daemon doesn't react fast enough to requests from the update process.

immanuelfodor commented 1 year ago

For the abnormal CPU usage, I think we should track https://github.com/rancher/rancher/issues/38816

sajjadG commented 1 year ago

I have the same problem with docker daemon being slow to answer as @schnapsidee stated. Is there an option to increase the timeout rke waits for the node to response?

Amitk3293 commented 1 year ago

@sajjadG , did you manage to increase this timeout?

sajjadG commented 1 year ago

@Amitk3293 No! But I rke removeed the node and tried rke up again and it worked!

github-actions[bot] commented 1 year ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

immanuelfodor commented 1 year ago

The CPU issue has been fixed: upgraded v1.23.16-rancher2-1 -> v1.24.13-rancher2-1 with enable_cri_dockerd: true and the CPU usage is normal.

github-actions[bot] commented 1 year ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.