[rke1] user is able to scale dedicated `etcd` machine pool down to `0` using the `...` option

slickwarren commented 2 years ago

Setup

Rancher version: v2.6.5-rc5
Browser type & version: chrome

Describe the bug UI button is greyed out, however the ... menu for the node allows a user to scale down any node to 0

To Reproduce

deploy an rke1 cluster, with dedicated nodes-per-role (i.e. 1 pool per role)
once the cluster is active, select the cluster from the cluster management page and click the ... option next to a node.
click the option to scale down on the etcd node

Result etcd node scales down, and cluster is down

Expected Result user should not be able to scale etcd, controlplane to 0

Screenshots

Additional context this is not an option on rke2 clusters.

richard-cox commented 2 years ago

The general work to prevent bad scale down states (see https://github.com/rancher/dashboard/issues/5454#issuecomment-1106688959) has been a relatively recent implementation in RKE2 land. I'm a little hesitant to try to implement the same process to RKE1 in the 2.6.5 timeframe. We need to understand if the same principles apply and update some interesting code.

richard-cox commented 2 years ago

There's a lot required to port this RKE2 feature to RKE1 (for RKE2 spec see https://github.com/rancher/dashboard/issues/5454#issuecomment-1106688959)

understand if the same rules as described in the comment apply to RKE1
understand how the existing cluster management restrictions are done and how they align with spec from above (when creating an RKE1 cluster in cluster manager it blocks creation until certain node type criteria is met)
handle the three places (scale down multiple instances, scale down single instances, scale down node pools)

I've updated the size to reflect this and to cover dev testing.

@nwmac @gaktive I've unassigned myself given commitments to epinio, plugins and PR reviews in the 2.6.6 time frame.

catherineluse commented 1 year ago

@gaktive @richard-cox I suggest closing this issue because according to this comment https://github.com/rancher/rancher/issues/36984#issuecomment-1256501708, scaling a pool down to 0 should be a supported use case, even if it does mean that the user could obliterate the cluster.

snasovich commented 1 year ago

@catherineluse @gaktive @richard-cox The general approach is to prevent only the most obviously wrong actions/changes on the back-end side. If there is any doubt that there is absolutely 0 chance of some customer requiring to perform such an action as part of their flow, we do not want to implement hard validation on the back-end side as it would prevent that action from getting completed by any means (including UI, Terraform, Rancher API call and even direct manipulation of k8s objects using tools like kubectl).

And since it's pretty much impossible to ensure adding such validation won't break some customers, we err on the side of safe for introducing such absolute blocks on back-end side and shift it to UI as its purpose is to guide the users and help prevent user errors.

So, given the above, I think we still want to pursue this change, along with https://github.com/rancher/dashboard/issues/6763

momesgin commented 9 months ago

@nwmac @kwwii while working on this issue, several questions came into my mind,

In the image above, you can see that we also have this "Scale Down" button that enables scaling down to zero that needs to be covered too.

Another interesting scenario is when users select multiple nodes:

This allows them to scale down all the selected nodes to zero. What UX/UI solution would be most effective in this case?

From my understanding, scaling down to zero is essentially equivalent to deleting the last node. Is this correct? If so, should we also consider a solution for deleting the last node?

richard-cox commented 9 months ago

@momesgin We have the restrictions and process already in place for rke2, which ensures that scaling down a machine pool, individual instance or multiple instances adheres to those restrictions. For instance if the user uses the bulk scale down button it ensures that there's at least one control plane, one etcd, etc and warns the user what will happen in a modal.

For this issue you'll need to go through the steps referenced in https://github.com/rancher/dashboard/issues/5774#issuecomment-1139392585.

momesgin commented 9 months ago

thanks @richard-cox, I found the slack conversion re RKE2, I'll ask the same question on slack but for RKE1

momesgin commented 9 months ago

Hello @snasovich, I've asked this question in our Slack channels, but unfortunately, I couldn't find an answer. We want to ensure if the same rules for RKE2 apply to RKE1 before implementing them. The following are currently valid for RKE2:

Scaling down to zero workers
Scaling down to one control plane
Scaling down to one etcd
Scaling down to zero machines in a deployment (as long as the above rules aren't invalidated)

Could you please provide some guidance on whether these rules apply to RKE1 as well? I would appreciate your help 🙏

snasovich commented 9 months ago

@momesgin , I can't come up with all the rules off top of my head without testing. But at a glance the rules you mentioned make sense. What do you mean by "Scaling down to zero machines in a deployment (as long as the above rules aren't invalidated)", I'm not quite following it. Most specifically the "deployment" word and what it relates to in this context.

momesgin commented 9 months ago

Thanks @snasovich for your reply. Regarding the last confusing rule, when I was talking to @richard-cox about this issue he mentioned that its equivalent rule for the RKE1 should be "Scaling down to zero nodes in a node pool (as long as the above rules aren't invalidated)".

slickwarren commented 9 months ago

If this helps, it sounds like the last bullet is just a catch-all. So as long as there will be at least 1 etcd and 1 cp role anywhere in the cluster, any pool would be able to scale to 0. i.e. a cluster with 1 node etcd role 1 node cp role 1 node all roles 1 node etcd+controlplane node

you could do one of the following options:

scale all down to zero except 'all roles' node
scale all down to zero except 'etcd+controlplane' node
scale all roles and etcd+controlplane node to zero, leaving the dedicated etcd and control plane nodes at 1

I think it would be more clear to say a single rule:

at least 1 control plane and 1 etcd role must be present in a cluster at any given time

This way it is more clear that clusters with multiple pools can still scale to 0 as long as there is still 1 etcd role and 1 controlplane role in the cluster.

richard-cox commented 9 months ago

The last point regarding deployments was added to confirm that deployments with zero machines were valid.

In RKE1 world we just need to confirm the same, that pools with no nodes are valid (as long as the other rules aren't invalidated). Unless @snasovich objects we'll assume that's fine

mantis-toboggan-md commented 1 week ago

I have validated this usse by testing rke1 clusters with:

1 dedicated etcd node pool 1 all-role pool Multiple pools with etcd role

rancher / dashboard

[rke1] user is able to scale dedicated `etcd` machine pool down to `0` using the `...` option #5774