harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.74k stars 313 forks source link

[BUG] Fail to scale down RKE2 Harvester node driver cluster #4358

Closed albinsun closed 2 months ago

albinsun commented 1 year ago

Description

After scale up RKE2 cluster from 1 to 2 VMs, Fail to scale down back.

Environment

Note Same test works if using Rancher v2.7.4, see comment below

To Reproduce

  1. Import Harvester into Rancher ![image](https://github.com/harvester/harvester/assets/2773781/60a4cea4-771e-4150-aeeb-b9ccf1b8e0a8)
  2. [H] Upload image "focal-server-cloudimg-amd64.img" ![image](https://github.com/harvester/harvester/assets/2773781/4200846d-8134-40ca-a905-313d7c8cf73a)
  3. [H] Create VM Network "mgmt-vlan1" ![image](https://github.com/harvester/harvester/assets/2773781/a6ccf394-3e7e-4b35-bb68-de1e66816161)
  4. [R] Create RKE2 Harvester node driver cluster "rke2-hvst-r275" (Takes ~20m) ![image](https://github.com/harvester/harvester/assets/2773781/26ca476a-4f1e-4244-9939-0b20adef1623)
    • 1 Machine
    • Image: focal-server-cloudimg-amd64.img
    • Network: mgmt-vlan1
    • Others uses default value
  5. [R] Check rke2-hvst-r275 is Active ![image](https://github.com/harvester/tests/assets/2773781/b943e1d6-fc28-4e62-9691-5f1018539321)
  6. [H] Check 1 VM is Running and Harvester cluster is healthy ![image](https://github.com/harvester/tests/assets/2773781/9d8c149c-7dfb-4d15-a01a-4af98178699f) ![image](https://github.com/harvester/tests/assets/2773781/dc056c08-9d8a-4989-8150-b4d5fb15692e)
  7. [R] Scale up rke2-hvst-r275 (Takes ~20m) Using the "+" button ![image](https://github.com/harvester/tests/assets/2773781/b943e1d6-fc28-4e62-9691-5f1018539321)
  8. [R] Check rke2-hvst-r275 is Active and has 2 machines in pool ![image](https://github.com/harvester/tests/assets/2773781/9745f38d-4caa-4fe2-923c-8140876fc3e4) ![image](https://github.com/harvester/tests/assets/2773781/de93d77e-4601-4821-b786-7d4525849f66)
  9. [H] Check 2 VM is Running and Harvester cluster is healthy ![image](https://github.com/harvester/tests/assets/2773781/f48e35a4-92cd-4bf3-bd89-429b076f3c17) ![image](https://github.com/harvester/tests/assets/2773781/f1293379-272e-4f19-b782-5facf6cbe8c4)
  10. [R] Scale down rke2-hvst-r275 ${~~~\color{red}\textsf{X, do scale-downed to 1 VM on both sides (H and R), but the machine status on Rancher stuck in Updating}}$ * Status stuck in `Updating` ![image](https://github.com/harvester/harvester/assets/2773781/8294e9ab-ebe4-451d-84a5-0d58923ce85d) ![image](https://github.com/harvester/harvester/assets/2773781/0b7f9503-a9f5-4f63-8780-378eaafe6667) * Error Messge ``` Rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required. ``` ![image](https://github.com/harvester/harvester/assets/2773781/7b871d12-ab0f-457b-a7fd-54786edb43f4) * Provisioning Log ![image](https://github.com/harvester/harvester/assets/2773781/026519ac-c832-4ac9-915e-edfebde5c7f1)

Expected Behavior

  1. RKE2 cluster can scale down back to 1 machine
  2. Better for system to has error msg (e.g. XXX timed-out) instead of stucking in some state.

Support Bundle

supportbundle_ScaleDownFail_2023-08-01T11-19-57Z.zip

Note

  1. Reproducibility: 2/2
  2. Note that same test works if using Rancher v2.7.4, see comment below.

Upstream issue: https://github.com/rancher/rancher/issues/42582

albinsun commented 1 year ago

Another reproduce, this time both 2 machines stuck at Reconciling over 30 mins. But harvester side did scale down to 1 VM

Support bundle supportbundle_ScaleDownFail2_2023-08-01T14-18-30Z.zip

Rancher

Harvester

albinsun commented 1 year ago

FYI Same steps are OK if using Rancher v2.7.4 (RKE2: v1.25.11+rke2r1) supportbundle_ScaleDownOk-r274_2023-08-01T16-30-10Z.zip

Before

Rancher

image image image

Harvester

image image

After

Rancher

image image

Harvester

image image

khushboo-rancher commented 1 year ago

@albinsun This could be related to https://github.com/rancher/rancher/issues/42034. Could you check with Rancher 2.7-head whenever get a chance?

albinsun commented 1 year ago

FYI, test pass with Rancher v2.7-head (d8a2bca)

Environment

Test Case

  1. Scale Pool Up ${~~~\color{green}\textsf{V}}$

    • Rancher image
    • Harvester image
  2. Scale Pool Down ${~~~\color{green}\textsf{V}}$

    • Rancher image
    • Harvester image
khushboo-rancher commented 1 year ago

Thanks @albinsun for checking this, we can close this for now as this is solved with Rancher upcoming release 2.7.6.

cc: @guangbochen @bk201

albinsun commented 1 year ago

Reopen since fail to scale down with Harvester v1.2.0-rc5 + Rancher v2.7.6-rc4

Environment

Test Case

Create Cluster

  1. Cluster Status ${~~~\color{green}\textsf{V}}$ ![image](https://github.com/harvester/harvester/assets/2773781/710a40d6-74dd-4f28-a1db-41107086ef4f)
  2. Harvester Side Status ${~~~\color{green}\textsf{V}}$ * Host ![image](https://github.com/harvester/harvester/assets/2773781/15320347-d98c-4216-9224-9107ad99fbcc) * VM ![image](https://github.com/harvester/harvester/assets/2773781/3e8ab5bd-f0cd-41ed-9d1a-f3f244726e42)
  3. Check Workload and Service Discovery ${~~~\color{green}\textsf{V}}$

Scale up

  1. Cluster Status ${~~~\color{green}\textsf{V}}$ ![image](https://github.com/harvester/harvester/assets/2773781/ee91a751-6641-4367-87a6-1af8190ba2bd)
  2. Harvester Side Status ${~~~\color{green}\textsf{V}}$ * Host ![image](https://github.com/harvester/harvester/assets/2773781/6bfad844-e11f-4e21-a96a-99e45c59d9fb) * VM ![image](https://github.com/harvester/harvester/assets/2773781/0b7b6a64-601e-4480-8760-d3f79be3f027)
  3. Check Workload and Service Discovery ${~~~\color{green}\textsf{V}}$ ![image](https://github.com/harvester/harvester/assets/2773781/10bdcdb7-d9a0-4b08-9fc5-704d41212327)

Scale down

  1. Cluster Status ${~~~\color{red}\textsf{X}}$ * `rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.` ![image](https://github.com/harvester/harvester/assets/2773781/6c973c23-91dd-4787-bd7f-b5f450348088) * Support bundle [supportbundle_rke2-harvester-26a.zip](https://github.com/harvester/harvester/files/12419105/supportbundle_rke2-harvester-26a.zip) * Provision Log ``` 6:48:33 pm | [INFO ] waiting for infrastructure ready 6:48:35 pm | [INFO ] waiting for viable init node 6:52:31 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for agent to check in and apply initial plan 6:52:49 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet 6:56:15 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler 6:56:31 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: calico, kube-apiserver, kube-controller-manager, kube-scheduler 6:58:05 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: calico 7:01:23 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for cluster agent to connect 7:03:35 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver 7:03:37 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-controller-manager 7:03:39 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-controller-manager, kube-scheduler 7:04:41 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-controller-manager 7:04:51 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for cluster agent to connect 7:05:19 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver 7:05:29 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for cluster agent to connect 7:05:33 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-scheduler 7:05:51 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for cluster agent to connect 7:06:17 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver 7:06:19 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for cluster agent to connect 7:07:15 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-controller-manager 7:07:17 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-controller-manager, kube-scheduler 7:07:53 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-scheduler 7:07:59 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver, kube-scheduler 7:08:03 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-scheduler 7:08:13 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: etcd, kube-apiserver, kube-scheduler 7:08:59 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver, kube-scheduler, kubelet 7:09:01 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kubelet 7:09:35 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for cluster agent to connect 7:10:09 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-controller-manager, kube-scheduler 7:10:19 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-controller-manager 7:10:25 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for cluster agent to connect 7:11:26 pm | [INFO ] non-ready bootstrap machine(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k and join url to be available on bootstrap node 7:11:42 pm | [INFO ] marking control plane as initialized and ready 7:11:44 pm | [INFO ] provisioning done 7:31:16 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for plan to be applied 7:31:52 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: creating server [fleet-default/rke2-hvst-26a-pool1-ddb478b9-j9pvc] of kind (HarvesterMachine) for machine rke2-hvst-26a-pool1-7bd8cf9947-j6dr8 in infrastructure provider, waiting for agent to check in and apply initial plan 7:33:34 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for agent to check in and apply initial plan 7:33:44 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet 7:35:34 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: etcd, kube-apiserver 7:35:36 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver 7:35:38 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-controller-manager, kube-scheduler 7:36:24 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler 7:36:54 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, kube-apiserver, kube-controller-manager, kube-scheduler 7:38:40 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, kube-controller-manager, kube-scheduler 7:38:50 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico 7:39:26 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for cluster agent to connect 7:39:40 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico 7:40:50 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver 7:41:14 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico 7:41:46 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, kube-apiserver 7:41:48 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico 7:41:58 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, etcd, kube-apiserver 7:42:12 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager 7:42:14 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, kube-apiserver, kube-controller-manager, kube-scheduler 7:42:20 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, kube-apiserver 7:42:32 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, etcd, kube-apiserver 7:42:44 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: etcd, kube-apiserver, kube-scheduler 7:42:58 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver, kube-scheduler 7:43:02 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-scheduler 7:43:10 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kubelet 7:43:16 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kube-apiserver, kubelet 7:43:26 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: etcd, kube-apiserver, kubelet 7:43:52 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for probes: kubelet 7:44:00 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: Node condition Ready is False., waiting for probes: kubelet 7:44:02 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico, kube-scheduler 7:44:08 pm | [INFO ] configuring etcd node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: calico 7:44:28 pm | [INFO ] rke2-hvst-26a-pool1-7bd8cf9947-j6dr8 7:44:30 pm | [INFO ] provisioning done 7:56:04 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-zml7k: waiting for plan to be applied 7:56:06 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for plan to be applied 7:56:52 pm | [INFO ] configuring bootstrap node(s) rke2-hvst-26a-pool1-7bd8cf9947-j6dr8: waiting for probes: kubelet 7:57:18 pm | [INFO ] waiting for all etcd machines to be deleted 7:59:37 pm | [INFO ] waiting for at least one control plane, etcd, and worker node to be registered 7:59:49 pm | [INFO ] rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required. ```
  2. Harvester Side Status ${~~~\color{green}\textsf{V}}$ * Host ![image](https://github.com/harvester/harvester/assets/2773781/a4edb230-c380-43a6-9c8c-98d58fc19677) * VM ![image](https://github.com/harvester/harvester/assets/2773781/2f72b5b0-627b-4094-a514-1a7bb75b8081)
  3. Check Workload and Service Discovery ${~~~\color{gray}\textsf{?}}$

guangbochen commented 1 year ago

By following the testing plan on https://github.com/rancher/rancher/issues/42121#issuecomment-1656310669, it seems scale down the master machine pool from 2 to 1 works fine in my test(Harvester v1.2.0-rc5 + Rancher v2.7.6-rc4)

image image image
albinsun commented 1 year ago

Test k3s-rancher (v2.7.6-rc4), scale up and down twice, looks fine.

Environment

Steps

  1. Create Cluster myk3s ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/4e2a88c3-11ad-47ed-bde8-7c79cc36d051) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/ffae9813-bea1-4f8e-8d8a-6a2f5e4e458b)
  2. Scale Up (1st Round) ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/c932d657-b9cf-49e3-abd2-856b23468b0a) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/12106795-0c78-4558-9d17-8888364b6061)
  3. Scale Down (1st Round) ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/2bdeb351-f751-4cf1-b599-f6500838a92a) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/c8271874-629b-4d87-80a7-035378a7509e)
  4. Scale Up (2nd Round) ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/8e32a115-1a16-4834-ad1a-13c43b2d5fbe) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/602546f9-101d-4b53-871f-bb58be57c8c6)
  5. Scale Down (2nd Round) ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/bc129ff9-c9b6-4e97-bf6e-a730ea455a44) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/9a333862-1244-4661-a2c7-838d8857b8be)
bk201 commented 1 year ago

I tested with vcluster-rancher (Rancher v2.7.6-rc4). Scale up and down twice and it looks good.

albinsun commented 1 year ago

Test rke2-rancher (v2.7.6-rc5), scale up and down twice, fail to scale down in the 2nd round.

Note

In the failed case, we notice that the machine name is neither of 2 before scaled down.

  1. Initial State (2967x) image
  2. 1st scale up (2967x :heavy_check_mark: ) image
  3. 1st scale down (bt9bv :heavy_check_mark: ) image
  4. 2nd scale up (bt9bv :heavy_check_mark: ) image
  5. 2nd scale down (vs52g :question: ) image

Environment

Steps

  1. Create Cluster myrke2 ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/2d89426e-9d3e-44b5-836b-33584ac5bb6c) ![image](https://github.com/harvester/harvester/assets/2773781/57aa3a38-fd70-4c33-b6ec-55ddc488b02e) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/194befe9-8f9a-4a33-8c4f-37829f96e222)
  2. Scale Up (1st Round) ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/dabb04ba-961a-473d-9430-2a38f6c66652) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/5c8cb3da-2a48-4978-a624-aed6754f38ee)
  3. Scale Down (1st Round) ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/1092f379-71a1-4e22-a4ad-071a83ee8b76) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/7b1888c6-bdd9-4118-b342-e3867121bb1a)
  4. Scale Up (2nd Round) ${~~~\color{green}\textsf{V}}$ * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/2cad1117-196b-484e-b007-343ab8adfe5d) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/a923677d-7868-4dc1-a217-5a3f747e69a5)
  5. Scale Down (2nd Round) ${~~~\color{red}\textsf{X}}$ * Rancher Stucked in `Updateing` w/ error message `Rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.` ![image](https://github.com/harvester/harvester/assets/2773781/f1f40782-d8d7-477f-9f16-ce1e5c4a8e7e) Machine status `Waiting for Node Ref` ``` ... status: bootstrapReady: true conditions: - lastTransitionTime: '2023-08-28T05:07:56Z' status: 'True' type: Ready - lastTransitionTime: '2023-08-28T05:07:54Z' status: 'True' type: BootstrapReady - lastTransitionTime: '2023-08-28T05:09:50Z' status: 'True' type: InfrastructureReady - lastTransitionTime: '2023-08-28T05:07:54Z' reason: WaitingForNodeRef severity: Info status: 'False' type: NodeHealthy ... ``` Provision Log ``` 11:24:12 am | [INFO ] waiting for infrastructure ready 11:24:14 am | [INFO ] waiting for viable init node 11:27:30 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for agent to check in and apply initial plan 11:27:38 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet 11:29:34 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler 11:29:50 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, kube-apiserver, kube-controller-manager, kube-scheduler 11:30:46 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, kube-controller-manager, kube-scheduler 11:30:52 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico 11:31:55 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, etcd, kube-apiserver 11:31:57 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, kube-apiserver 11:31:59 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, kube-controller-manager, kube-scheduler 11:32:07 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico 11:34:31 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, kube-apiserver 11:34:33 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, kube-controller-manager, kube-scheduler 11:34:49 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico, kube-scheduler 11:34:57 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico 11:35:17 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for cluster agent to connect 11:35:47 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: etcd, kube-apiserver 11:35:49 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 11:36:13 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager 11:36:19 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for cluster agent to connect 11:37:31 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager 11:37:33 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 11:37:51 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler 11:37:53 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 11:38:33 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: etcd, kube-apiserver, kube-controller-manager, kube-scheduler 11:38:35 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler 11:38:37 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 11:39:11 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for cluster agent to connect 11:39:13 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: calico 11:39:17 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for cluster agent to connect 11:43:35 am | [INFO ] non-ready bootstrap machine(s) myrke2-pool1-84b4bf7d7c-2967x and join url to be available on bootstrap node 11:43:53 am | [INFO ] provisioning done 11:54:37 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for plan to be applied 11:54:49 am | [INFO ] waiting for machine fleet-default/myrke2-pool1-84b4bf7d7c-bt9bv driver config to be saved 11:56:51 am | [INFO ] configuring etcd node(s) myrke2-pool1-84b4bf7d7c-bt9bv: creating server [fleet-default/myrke2-pool1-5b5b00d2-gjltc] of kind (HarvesterMachine) for machine myrke2-pool1-84b4bf7d7c-bt9bv in infrastructure provider, waiting for agent to check in and apply initial plan 11:56:55 am | [INFO ] configuring etcd node(s) myrke2-pool1-84b4bf7d7c-bt9bv: waiting for agent to check in and apply initial plan 11:57:03 am | [INFO ] configuring etcd node(s) myrke2-pool1-84b4bf7d7c-bt9bv: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet 11:57:51 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: etcd, kube-apiserver 11:57:55 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver 11:58:03 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 11:58:33 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler 11:58:35 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 11:58:45 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler 11:58:47 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 11:59:01 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler 11:59:07 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 11:59:13 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: etcd, kube-controller-manager, kube-scheduler 11:59:15 am | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager 11:59:23 am | [INFO ] configuring etcd node(s) myrke2-pool1-84b4bf7d7c-bt9bv: waiting for probes: calico, etcd, kube-apiserver, kube-controller-manager, kube-scheduler 11:59:53 am | [INFO ] configuring etcd node(s) myrke2-pool1-84b4bf7d7c-bt9bv: waiting for probes: calico, kube-apiserver, kube-controller-manager, kube-scheduler 12:00:29 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver 12:00:31 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 12:00:51 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager 12:00:57 pm | [INFO ] configuring etcd node(s) myrke2-pool1-84b4bf7d7c-bt9bv: waiting for probes: calico, kube-controller-manager, kube-scheduler 12:00:59 pm | [INFO ] configuring etcd node(s) myrke2-pool1-84b4bf7d7c-bt9bv: waiting for probes: calico 12:01:45 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver 12:01:47 pm | [INFO ] configuring etcd node(s) myrke2-pool1-84b4bf7d7c-bt9bv: waiting for probes: calico 12:01:51 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for cluster agent to connect 12:02:33 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: etcd, kube-apiserver 12:02:45 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: etcd, kube-apiserver, kube-controller-manager 12:02:49 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: etcd, kube-apiserver, kube-controller-manager, kube-scheduler 12:03:17 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler 12:03:19 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler, kubelet 12:03:31 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet 12:03:33 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler, kubelet 12:03:49 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet 12:04:33 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler, kubelet 12:04:39 pm | [INFO ] configuring bootstrap node(s) myrke2-pool1-84b4bf7d7c-2967x: waiting for probes: kube-controller-manager, kube-scheduler 1:07:55 pm | [INFO ] rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required. ``` * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/b3769485-43f1-40b8-b3a8-a381e100ae9d)
bk201 commented 1 year ago

Upstream issue: https://github.com/rancher/rancher/issues/42582

albinsun commented 10 months ago

Also hit in K3s cluster, sysrem delete both 2 original nodes and create a new one, then status stuck in Updating (rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.) image

TachunLin commented 9 months ago

We encounter similar issue on Rancher v2.7.9 with Harvester v1.1.3-rc1 While scaling down the RKE2 cluster from 3 nodes to 2 nodes, the first node stuck in Reconcilling state

image image

Attached the Harvester support bundle, Rancher instance logs for more information

innobead commented 3 months ago

https://github.com/rancher/rancher/issues/42582 seems fixed in Rancher 2.9.0.

FrankYang0529 commented 2 months ago

The upstream PR https://github.com/kubernetes-sigs/cluster-api/pull/10431 was included in CAPI v1.6.6. In Rancher v2.9.0-alpha5, the default CAPI version is v1.6.6. I have confirmed the issue can't be reproduced with the version.

harvesterhci-io-github-bot commented 2 months ago

Pre Ready-For-Testing Checklist

harvesterhci-io-github-bot commented 2 months ago

Automation e2e test issue: harvester/tests#1348

bk201 commented 2 months ago

also note the fix is not going to be backported to Rancher 2.8.x https://github.com/rancher/rancher/issues/42582#issuecomment-2197670273

albinsun commented 2 months ago

Test OK on v1.3.1 + rancher-v2.9.0-alpha7 with RKE2 v1.28.10+rke2r1, close as fixed.

Environment

Test Cases

  1. Join Harvester to Rancher ![image](https://github.com/harvester/harvester/assets/2773781/aa727175-dc66-4e1b-ab56-cf7cee8f2069)

Case 1 -> 2 -> 1

  1. Create single node RKE2 cluster * Nodes 1. _myrke2-pool1-cm7b6-p5ssx_ ![image](https://github.com/harvester/harvester/assets/2773781/b42b731f-99b9-4a9a-b1c9-8ffafa2167bf) * Harvester Apps ![image](https://github.com/harvester/harvester/assets/2773781/2ae8c39e-3cfa-45c2-85a2-138c543f9a67)
  2. Deploy Nginx (w/o pvc) and associated load balancer * Deployment ![image](https://github.com/harvester/harvester/assets/2773781/c0e36f5b-69bd-4b63-8e87-989a857dbcca) * Load Balancer ![image](https://github.com/harvester/harvester/assets/2773781/af86e3c2-6f05-411b-91da-5905e6ece5e5)
  3. ROUND 1: Scale up RKE2 cluster, check Nginx still works. * Nodes 1. _myrke2-pool1-cm7b6-p5ssx_ (Existed) 2. _myrke2-pool1-cm7b6-22x25_ (New) ![image](https://github.com/harvester/harvester/assets/2773781/9875c58d-eed4-4753-a45e-353e328392d6) * Harvester Apps ![image](https://github.com/harvester/harvester/assets/2773781/cd8a9651-aad9-44cb-aa65-bc65e1298fce) * Nginx still work ![image](https://github.com/harvester/harvester/assets/2773781/8e02c077-593c-42f1-884d-8fc50c020e1d)
  4. ROUND 1: Scale down RKE2 cluster, check Nginx still works. * Rancher 1. _myrke2-pool1-cm7b6-p5ssx_ (Existed) 2. _myrke2-pool1-cm7b6-22x25_ (New) ![image](https://github.com/harvester/harvester/assets/2773781/56e33582-e393-48ef-98f4-755b66434380) * Harvester Apps ![image](https://github.com/harvester/harvester/assets/2773781/aad3f03c-a6b2-4f3e-9a10-891a3e9f9a14) * Nginx still work ![image](https://github.com/harvester/harvester/assets/2773781/2cd7e6b9-4771-4e5d-909b-f3e9576ad6f2)
  5. ROUND 2: Scale up RKE2 cluster, check Nginx still works. * Rancher 1. _myrke2-pool1-cm7b6-22x25_ (Existed) 2. _myrke2-pool1-cm7b6-n7cbz_ (New) ![image](https://github.com/harvester/harvester/assets/2773781/1cc91594-84d7-4839-a760-e85747c22f35) ![image](https://github.com/harvester/harvester/assets/2773781/fa8b1a68-d5db-42ec-8b37-fd1ac56f30ce) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/e4ad7b1c-6995-4b0d-9ae1-b773ab1e43ac) * Nginx still work ![image](https://github.com/harvester/harvester/assets/2773781/76a71e5c-a14d-4aad-b594-1b2af66fc8bb)
  6. ROUND 2: Scale down RKE2 cluster, check Nginx still works. * Rancher 1. _myrke2-pool1-cm7b6-22x25_ (Existed) 2. _myrke2-pool1-cm7b6-n7cbz_ (New) ![image](https://github.com/harvester/harvester/assets/2773781/922e1cae-3b82-4db9-9e41-614c04370e66) ![image](https://github.com/harvester/harvester/assets/2773781/4ca31f47-c67f-4b19-8e98-84a67fb3e487) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/312c3b41-50e1-435e-94ea-5e3ba83f6351) * Nginx still work ![image](https://github.com/harvester/harvester/assets/2773781/658efc27-558e-4696-8ec4-ecb3ba9580d2)
  7. Delete RKE2 cluster * Rancher ![image](https://github.com/harvester/harvester/assets/2773781/3cefba98-df32-4979-873c-b059ff7dc283) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/8042f28e-1283-47be-bdf7-a7724dd9b47a)

Case 2 -> 3 -> 2

  1. Create 2 nodes RKE2 cluster * Rancher 1. _myrke2-2-pool1-rqdk5-6fsvl_ 2. _myrke2-2-pool1-rqdk5-ctc9z_ ![image](https://github.com/harvester/harvester/assets/2773781/be52d489-0d29-4866-a4dd-ba35de7c77f8) ![image](https://github.com/harvester/harvester/assets/2773781/481dc859-9ba2-4d79-acd8-ef2849ae09bb) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/937e89e7-4500-485f-ba4f-6c6064494123)
  2. Deploy Nginx (w/o pvc) and associated load balancer * Deployment ![image](https://github.com/harvester/harvester/assets/2773781/c86bf449-9434-423e-9c98-d23c1a4eee46) * Load Balancer ![image](https://github.com/harvester/harvester/assets/2773781/43416c5c-431c-4214-ac3b-eafa1196e346)
  3. ROUND 1: Scale up RKE2 cluster, check Nginx still works. * Rancher 1. _myrke2-2-pool1-rqdk5-5lfpd_ (New) 1. _myrke2-2-pool1-rqdk5-6fsvl_ (Existed) 1. _myrke2-2-pool1-rqdk5-ctc9z_ (Existed) ![image](https://github.com/harvester/harvester/assets/2773781/950256e4-bf82-4138-8673-fa7683c82ff3) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/c2690f16-397a-4a60-9845-fa024192cca6) * Nginx still work ![image](https://github.com/harvester/harvester/assets/2773781/0b8f1c51-167c-4e4d-9168-ffbc6f28cb44) ![image](https://github.com/harvester/harvester/assets/2773781/79cd3818-5f14-45b6-bf1d-14d7a309df97)
  4. ROUND 1: Scale down RKE2 cluster, check Nginx still works. * Rancher 1. _myrke2-2-pool1-rqdk5-5lfpd_ (New) 1. _myrke2-2-pool1-rqdk5-6fsvl_ (Existed) 1. _myrke2-2-pool1-rqdk5-ctc9z_ (Existed) ![image](https://github.com/harvester/harvester/assets/2773781/222652ad-48e8-4119-9add-8e69c2ad7179) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/697e15d4-e241-4725-94ec-0020ed844fcd) * Nginx still work ![image](https://github.com/harvester/harvester/assets/2773781/a189ea8f-08b8-4df6-b0d9-3bba0be0f1b5)
  5. ROUND 2: Scale up RKE2 cluster, check Nginx still works. * Rancher 1. _myrke2-2-pool1-rqdk5-5lfpd_ (Existed) 1. _myrke2-2-pool1-rqdk5-ctc9z_ (Existed) 1. _myrke2-2-pool1-rqdk5-lm7l2_ (New) ![image](https://github.com/harvester/harvester/assets/2773781/241e9792-f3d0-4cb8-83ed-507137eb948a) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/b67876fa-00e4-4540-b6b4-3c22898dba98) * Nginx still work ![image](https://github.com/harvester/harvester/assets/2773781/68f8be5c-12a1-4a4a-892e-9d56f0aa4119)
  6. ROUND 2: Scale down RKE2 cluster, check Nginx still works. * Rancher 1. _myrke2-2-pool1-rqdk5-5lfpd_ (Existed) 1. _myrke2-2-pool1-rqdk5-ctc9z_ (Existed) 1. _myrke2-2-pool1-rqdk5-lm7l2_ (New) ![image](https://github.com/harvester/harvester/assets/2773781/d45c9dca-587e-4557-b7c6-f8a0581229f3) * Harvester ![image](https://github.com/harvester/harvester/assets/2773781/20c29599-608c-41e4-a74b-f596bd8b58a2) * Nginx still work ![image](https://github.com/harvester/harvester/assets/2773781/5ff4fa4d-154e-423e-ab3d-25de1c45c65c)
  7. Delete RKE2 cluster ![image](https://github.com/harvester/harvester/assets/2773781/ecca501e-0dce-4f6e-aca0-7982af108ec2) ![image](https://github.com/harvester/harvester/assets/2773781/6b1a4470-09c7-4834-b791-d9ca0ad63127)