rancher / dashboard

The Rancher UI
https://rancher.com
Apache License 2.0
441 stars 244 forks source link

Epic: Machine Pool UI not showing the right vSphere Template #9559

Open gaktive opened 11 months ago

gaktive commented 11 months ago

Internal reference: SURE-6778 Reported in 2.7.5

A user deploys RKE2 clusters (1.24.x) in Rancher 2.7.5 as Vsphere Clusters using Terraform. Creation works fine, the cluster appears in Rancher and gets created normally.

They noticed that once the cluster is created, if they click on 'Edit Config' in Rancher UI, the field Template is not populated with the value defined of their TF file, but with another vm template of the vCenter.

It seems like Rancher takes the first template of the drop-down list. Most of the time, it is not a problem because, all action they make on any existing cluster is done using Terraform, so the good template is always used. But if, for any reason, they decide to modify any parameter of the cluster using the Rancher UI without correcting the template value on each pool, it will save the wrong template and proceed to recreate all the vm of the cluster.

Even worse, they don't need to click on save button to save the wrong template! In the 'Edit Config' interface, under 'Edit as YAML' it also save the change and start recreating the VM.

Business impact: High. The nodes/machine will be deleted and recreated.

vSphere config & Terraform setup available.

Proposed solution

rak-phillip commented 10 months ago

There's two issues associated with this bug, only one of which I'm able to reproduce at this point:

  1. Invalid Templates are selected when editing a VMware vSphere RKE2 Cluster (unable to reproduce, might be associated with Terraform or it's not an issue anymore)

  2. Clicking "Edit as YAML" will commit changes without the need to Save

rak-phillip commented 10 months ago

Converting this to an Epic because there are multiple issues described by this ticket

adventurousyeti commented 9 months ago

I have encountered this issue as well. When modifying the cron job for etcd. I used the UI to edit the config and it automatically selected a bad template for the control plane from the top. Since the previous template was not in the correct format in the datastore Rancher tried to rebuild the VM. This in turn caused a provisioning storm in vSphere as the template it had selected did not work. Since rancher was unable to install on the selected template it caused loss of cluster stability. This led to a 3 day outage of the downstream cluster.

a-blender commented 9 months ago

Based on discussion with @rak-phillip about the original issue where if a user clicks 'Edit Config' in Rancher UI, the node Template is not populated with the template defined in their TF config.

Unexpected behavior resulting from a user modifying a cluster via the UI that was originally created with Terraform is not necessarily supported. If a user creates a cluster via TF, it is expected they will update the infrastructure via running TF apply. It is also expected that the Edit Config page in the UI show the correct template data that the user defined via Terraform so that is a bug.

IMO this appears to be a UI issue where the UI is pulling the first vSphere node template from the backend server without considering TF template input. Either that, or TF is somehow not setting the template resource fields on the cluster management object correctly. This needs to be tested and verified.

@rak-phillip Were you unable to reproduce on 2.7.5?

adventurousyeti commented 9 months ago

I'm using 2.7.6 right now with TF provider 3.2.0. The template is showing up in the UI right now. I can reproduce this issue in 2 ways.

  1. Having an incorrect template in my TF script. (say wrong name)
  2. Modification or removal of the template in vCenter. (Say packer is rebuilding the template)

I can also reproduce in my home lab by deploying a new cluster > renaming the vSphere template > edit cluster config to view say etcd data.

a-blender commented 9 months ago

I would argue an incorrect template name in TF config is not a supported use case. But, if a correct vSphere template is being modified / rebuilt by packer in vCenter then you see the wrong template in the Edit Config page? Screenshots would be very helpful here :)

rak-phillip commented 9 months ago

@a-blender yes, we were able to repro and identify the issue reliably. We've made some minor enhancements to the form to prevent users from entering incorrect values and to warn them about the potential impact of changes, specifically:

We intend to follow up in a later release with more enhancements, but the changes in place will at least help increase awareness when there are potential errors in supplied data.

adventurousyeti commented 9 months ago

In vCenter I have the following templates. Screenshot 2023-10-20 at 2 45 34 PM Using the highlighted as the example. Rancher shows the correct template being selected. Screenshot 2023-10-20 at 2 48 21 PM Now I modify the name in vCenter. Screenshot 2023-10-20 at 2 45 47 PM Then when I go back to rancher the UI shows the following. Screenshot 2023-10-20 at 2 49 18 PM Which is the first template from those available in vCenter. I scrolled over this when working on setting a cron schedule thru the UI for etcd snapshots. Once I clicked save it triggered my control plane to rebuild.

adventurousyeti commented 9 months ago

The above happened outside of using edit as yaml. This example cluster was provisioned outside of TF and strictly thru the rancher UI.

a-blender commented 9 months ago

@rak-phillip Great, thank you!

@adventurousyeti Did you also update the name of the VM template in your TF config to Rancher_BM_PoC before trying to modify the cluster again?

a-blender commented 9 months ago

@adventurousyeti Got it, that is purely a UI issue and will be fixed by the UI team.

gaktive commented 8 months ago

@rak-phillip can help here as we transfer this ticket and related ones over.

gaktive commented 8 months ago

Possible backend ticket that's related (or blocks us): https://github.com/rancher/rancher/issues/41307