Auto Replace does not work (v2.5.1 on rke)

Negashev commented 3 years ago

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

1) install k8s by rke 1.2.1 on one node 2) install rancher 2.5.1 by helm 3) create cluster with hetzner node driver 4) add pool with auto replace node after 1-10 minutes 5) go to hetzner console and stop on node from pool (with auto replace)

Result: Rancher see that kubelet stop, but nothing happend with node in next 30+ minutes

Other details that may be helpful:

I have only one node with k8s for rancher and test cluster with one node for etcd and control panel, and 2 nodes (node pool with auto-replace)

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.1
Installation option (single install/HA): HA with one node k8s

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): hetzner docker-driver
Machine type (cloud/VM/metal) and specifications (CPU/memory): CX21
Kubernetes version (use kubectl version):

v1.19.3

Docker version (use docker version):

19.3.13

sowmyav27 commented 3 years ago

On 2.4.8

Deploy a DO cluster node driver on a single node rancher install.
Worker node's Auto Replace value is 5 minutes.
When the cluster comes up Active, power off the worker nod ein the cluster
the node will be seen as Unavailable
After 5 minutes, the node will be removed and another worker node is seen in Provisioning state.

On 2.5.1

Deploy a DO cluster node driver on a single node rancher install.
Worker node's Auto Replace value is 5 minutes.
When the cluster comes up Active, power off the worker node in the cluster
the node will be seen as Unavailable
A new node is NOT provisioned after 5 minutes.

mrajashree commented 3 years ago

This happens only on k8s 1.19 clusters, because the node doesn't get the taint node.kubernetes.io/unreachable:NoExecute Could be related to https://github.com/kubernetes/kubernetes/issues/94183 Although that bug description has k8s 1.18.6 as the version, whereas it does work on k8s 1.18.10

sowmyav27 commented 3 years ago

On 2.5-head commit id: a90aa3ca and master-head commit id: b7e8c0d3

Deploy a DO cluster node driver on a single node rancher install.
Worker node's Auto Replace value is 5 minutes.
When the cluster comes up Active, power off the worker node in the cluster
the node will be seen as Unavailable
A new node is provisioned after 5 minutes.
Tested this using k8s 1.19 and 1.18

rancher / rancher

Auto Replace does not work (v2.5.1 on rke) #29754