Add a way to refresh one machine only (with scale out)

koba1t commented 10 months ago

What would you like to be added (User Story)?

I need a feature to restart one machine without restarting all nodes. Currently, the machine-deployment controller only provides cluster rolling update operation. We can recreate one machine by removing its machine resource, but that operation temporarily reduces the total computing capacity of the entire cluster.

Sometimes, any node will become unstable, and cluster admins will restart/recreate that node to resolve that problem. We don’t want to restart/recreate all nodes at once because it takes more time to complete and makes application performance unstable.

Detailed Description

Add a way to add one machine before actually terminating the machine. We need a means to remove one machine after running a new same-size machine. Our idea is to define a new annotation like cluster.x-k8s.io/refresh that refreshes one machine if that annotation adds machine resources. https://cluster-api.sigs.k8s.io/reference/labels_and_annotations

Anything else you would like to add?

We can also achieve the goal by having the following logic on our side, without introducing additional logic to the Cluster API side.

Add one to replicas for a machine deployment resource.
Stop machineDeployment controller using cluster.x-k8s.io/paused labels.
Delete a machine that contains any problem with kubectl delete.
Decrease one to replicas.
Clean up cluster.x-k8s.io/paused labels.

It may be related to this request: https://github.com/kubernetes-sigs/cluster-api/issues/1808 I’ll write an enhancement proposal if you think that is needed.

Label(s) to be applied

/kind feature /area machine

k8s-ci-robot commented 10 months ago

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

koba1t commented 10 months ago

/cc @musaprg

fabriziopandini commented 10 months ago

Sometimes, any node will become unstable

Can you specify better what does this means / how this condition shows up? I'm asking this because if MHC can automatically detect this condition, then users can benefit from everything the remediation already supports, e.g maxUnhealthy budget.

We can recreate one machine by removing its machine resource, but that operation temporarily reduces the total computing capacity of the entire cluster. ... Add a way to add one machine before actually terminating the machine.

This is interesting, it will be a nice remediation strategy to have for MachineDeployment (and probably for MachinePools as well). However, might be that this is sort of tricky due to how MachineDeployment and MachineSet works, but I did not check the code.

Let's also see if someone else is interested in this idea.

koba1t commented 9 months ago

Can you specify better what does this means / how this condition shows up? I'm asking this because if MHC can automatically detect this condition, then users can benefit from everything the remediation already supports, e.g maxUnhealthy budget.

In our scenario, the node appears healthy, but underlying issues affect the application, such as network latencies and performance problems. We operate clusters on OpenStack using on-prem hypervisors. Occasionally, the problem may be attributed to a virtual machine or hypervisor issue. In such cases, the cluster node metrics indicate healthiness, prompting the cluster operator to manually restart the node to resolve the underlying problem within the cluster. For instance, the physical hypervisor may exhibit signs of healthiness despite underlying issues, or there could be problems with the daemon on the Linux node, particularly on GPU nodes.

nabokihms commented 9 months ago

We have the same feature request. Two use cases:

Sometimes the state of the machine can become dirty because of installed packages, network settings, etc.
In exceptional cases engineers want to manually restart all machines one by one by hand and control what is happening with their applications on nodes.

I opened #10027 today, but it was closed as a duplicate. The proposed solutions were slightly different from the one from the opening message.

The proposal is to make a machine deleting behavior similar to pod deleting. When the rolling update settings are set for a pod controller (e.g. deployment, daemons), and a pod is deleted manually, kube-controller-manager creates an additional pod first and then deletes the old one.

My idea is to make capi-controller act the same way.

Propagate the rolling update settings from a machine set to the machine controller.
Change reconciliation logic for the machine deleting events.

fabriziopandini commented 7 months ago

/priority backlog

k8s-ci-robot commented 6 months ago

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

koba1t commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 weeks ago