Self healing control plane: Enable MHC to watch control plane machines

enxebre commented 4 years ago

⚠️ Cluster API maintainers can ask to turn an issue-proposal into a CAEP when necessary, this is to be expected for large changes that impact multiple components, breaking changes, or new large features.

Goals

Enable MHC to watch control plane machines so faulty nodes can be auto repaired.

Non-Goals

Introduce any control plane awareness logic in the MHC controller.
Define unhealthy criteria / set node conditions.
Recovering for scenarios where quorum would be violated.

User Story

As a user/operator I'd my control plane machines/nodes to be self healing. I.e gracefully deleted and recreated for safe scenarios where etcd quorum wouldn't be violated.

Detailed Description The Control Plane Provider should able to reconcile by removing etcd member for voluntary machine disruptions i.e deletion. https://github.com/kubernetes-sigs/cluster-api/issues/2818

Currently MHC only remediate machines owned by a machineSet. With the above in place we can relax this constraint to remediate any machine owned by a controller i.e controlPlaneController https://github.com/kubernetes-sigs/cluster-api/pull/2828.

spec.maxUnhealthy should be leveraged to ensure remediation is not triggered in a quorum disruptive manner e.g only 1 for a 3 machines control plane or 2 for a 5 machines control plane.

The Control Plane is a critical and sensitive component, where we want to favour control over flexibility. Therefore I propose introducing a new field EnableAutorepair for the KubeadmControlPlaneSpec which creates, owns a MHC resource and enforce the maxUnhealthy value appropriately based on the Control Plane size.

if we have general agreement in the approach the issue describes we can then discuss putting any extra safety in place if we wish to prevent users from shooting themselves in the foot creating out of band MHCs. E.g we could let MHC to remediate masters machines only if the MHC is owned by the Control Plane.

Contract changes [optional]

[Describe contract changes between Cluster API controllers, if applicable.]

Data model changes [optional]

[Describe contract changes between Cluster API controllers, if applicable.]

/kind proposal

enxebre commented 4 years ago

/area control-plane

vincepri commented 4 years ago

/milestone v0.3.x /priority important-longterm

vincepri commented 4 years ago

I'd like to propose that for control plane nodes managed by a controller reference (e.g. KubeadmConfig) we annotate the Machine without deleting it.

To describe the flow a little bit in more details:

MHC watches control plane machines
Health condition fails
MHC controller annotates the Machine(s) that failed health check
KCP receives a notification
KCP controller looks at all the machines that have been annotated by MHC
KCP remediates these machines by scaling down and then back up

The above introduces a new contract for control plane providers, that they'll need to respect the annotation on each reconciliation loop and remediate on their own. On the other hand, it gives more power to the control plane controllers to control when and how to remediate its machines.

Thoughts, questions, concerns?

vincepri commented 4 years ago

/milestone v0.3.4

vincepri commented 4 years ago

cc @ncdc @sedefsavas

detiber commented 4 years ago

@vincepri your comment got me thinking, what if instead of modifying the MHC behavior based on the type of the controller reference we actually modified the Machine controller to handle deletion behavior differently in that case instead?

That would allow for blocking deletion behavior not only on MHC-initiated deletions, but any external deletions until the owning controller "allows" the deletion (possibly via annotation).

It might be good to also extend that behavior to machine infrastructure providers as well, since they own the actual resource that is probably more important to block deletion of.

vincepri commented 4 years ago

@detiber Thinking through how that would work

MHC watches control plane machines
Health condition fails
MHC controller issues a delete on the Machine
Machine controller sees the delete request, but waits (by returning early) on a specific annotation to be present.
KCP receives a notification, does XYZ actions (like scaling up), and finally patches the Machine with annotation.
Machine controller receives the annotation and proceed deleting infrastructure (etc.).

This would be interesting to pursue, depending how we implement it we can allow any generic owner-controller reference to act this way — although this would require changes to MachineSet controller as well. We could use a special finalizer here that KCP adds on creates, instead of a specific annotation, which would make the behavior opt-in instead.

One thing to consider is that now KCP will have to patch before issuing a delete on Machines.

detiber commented 4 years ago

This would be interesting to pursue, depending how we implement it we can allow any generic owner-controller reference to act this way — although this would require changes to MachineSet controller as well. We could use a special finalizer here that KCP adds on creates, instead of a specific annotation, which would make the behavior opt-in instead.

I do like the idea of making the feature opt-in rather than opt-out. I do worry a bit about trying to use a finalizer for this, since finalizers block the deletion of the underlying k8s resource from storage and do not block the reconciliation of the deletion by other controllers. If we follow that path the deletion is likely going to be too far along for us to be able to attempt to maintain quorum in a recoverable way before it's too late.

We could block reconciliation of delete in the Machine (and InfraMachine) controller(s), but that would diverge from the more common semantic use of finalizers in Kubernetes.

We could keep the same behavior by applying an annotation to block deletion reconciliation during resource creation in KCP (and anywhere else that wants to implement the behavior) and have KCP remove the annotation to "allow" the deletion to proceed.

This wouldn't preclude the addition/removal of a finalizer which might not be a bad thing to also add, I just worry about overloading the generally understood semantics of a finalizer for the complete implementation of the behavior we are potentially describing.

vincepri commented 4 years ago

Opt-in annotation works for me.

@ncdc @enxebre thoughts?

JoelSpeed commented 4 years ago

I am +1 for the idea of this being opt-in rather than opt out

To me, using a finalizer seems like the natural route since they're intended to be used for special deletion logic. My idea was that the Machine controller could check the the finalizers list consists only of it's own finalizer before proceeding with the deletion logic, that said, I have a vague feeling like this could be bad and make for a fragile blocking situation in the future, there's a comment on the types which suggests to me someone else has tried something like this before (ordered finalizers), but I don't know how likely this situation would be to arise within the CAPI ecosystem

If we use a opt in annotation, is there a possibility that two controllers could want to set/remove that known annotation? Should it somehow be a list of annotations (prefixed name?) so that multiple controllers could block the deletion of the machine independently if required? (Again, this feels like a list of finalizers and checking for being the last finalizer and I'm wondering if there's much of a difference)

ncdc commented 4 years ago

In a given controller that we own, we can control if we respect the order and/or count of the finalizers in the list. But we can't assume that any other controller is doing the same. I think we could be reasonably assured that all controllers in cluster-api plus the infrastructure providers with which we're involved could do this, but there's no way to guarantee that an arbitrary bootstrap/control plane/infrastructure provider does follow the same rules.

My gut feeling is that it would be safer only to issue a delete when we truly intend to delete a thing. In other words, I think I'm in favor of Vince's idea of having MHC "mark" (annotate/label) that a Machine is unhealthy (oooh or maybe it could add to .status.conditions, once we have that?), and that the controller responsible for that Machine is the one that is expected to issue the delete.

Just my 2 cents... oh, and we also have #1514 which has some prior discussion in this area too.

vincepri commented 4 years ago

For v1alpha3, we should keep the scope as minimal as possible. Supporting this feature for control plane machines is a very specific use case which we can tackle, learn, and extend once we're ready to work on vNext.

That said, I'd like us to find a common solution if we want to tackle #1514 or #2846 in v0.3.x

enxebre commented 4 years ago

mm I'm not fully sold on why we wouldn't put a finaliser on machines to enforce that the KCP cleans up the etcd membership. That would effectively solve https://github.com/kubernetes-sigs/cluster-api/issues/2818 / https://github.com/kubernetes-sigs/cluster-api/issues/2759.

If we solved the above then this is orthogonal and transparent to MHCs. It would work consistently for any voluntary machine disruption. If I want to manually "remediate" my machine I just kubectl delete it and I see the same consistent safety behaviour. The core KCP is to let us stop treating this machines as pets so as a consumer I can safely signal a deletion request if I choose to.

Otherwise starting with Vince's annotation workflow seems reasonable to me.

vincepri commented 4 years ago

If we go down the path of MHC annotating unhealthy Machines, I think we should change up the current implementation as well and keep it consistent. From a user point of view, it won't be a breaking change, the delete will be issued by another controller, but that's pretty much it.

I'd like to see a google doc proposal though, given that there are a lot of parts that need to be touched. Possibly, we need to include use cases from #1514 or #2846.

ncdc commented 4 years ago

mm I'm not fully sold on why we wouldn't put a finaliser on machines to enforce that the KCP cleans up the etcd membership.

Doesn't this require a coordinated order of operations across the KCP, Machine, and InfraMachine controllers though? Or can we do it without strict ordering?

enxebre commented 4 years ago

I suppose we could block Machine and InfraMachine controllers from proceeding until this finalizer is removed. But I don't think is even necessary and we can probably do it without strict ordering:

The KCP will watch a deletion timestamp and will trigger the "remove the etcd member" immediately while the node still needs to be drained.

Worst case scenario If by any chance it happens that the infra goes away before "remove the etcd member", etcd would be still functional with just a degraded member for a minimal period of time. maxUnhealthy will prevent from ever signal a deletion that might cause lose quorum.

vincepri commented 4 years ago

KCP should create a new member though before the Machine is deleted, we should give full control to the controller's owner reference, rather than issuing a delete and then blocking it

detiber commented 4 years ago

Worst case scenario If by any chance it happens that the infra goes away before "remove the etcd member", etcd would be still functional with just a degraded member for a minimal period of time. maxUnhealthy will prevent from ever signal a deletion that might cause lose quorum.

Not if the number of deletions issued exceeds the amount that would cause quorum to be lost, or worse include the last member which would cause complete data loss.

detiber commented 4 years ago

maxUnhealthy cannot necessarily be relied upon, since there is nothing to ensure that it is sync'd with a value that would prevent the loss of quorum compared.

enxebre commented 4 years ago

maxUnhealthy cannot necessarily be relied upon, since there is nothing to ensure that it is sync'd with a value that would prevent the loss of quorum compared.

In the issue desc above I envision this MHC to be owned by the KCP controller which would enforce maxUnhealthy based on the KCP replica field i.e for replicas 3 maxUnhealthy 1, and so on. So we'd prevent users from shooting on their foot with bad unsafe input values.

enxebre commented 4 years ago

So to summarise I think my preference would be:

Add ability to KCP controller to watch a deletion timestamp and remove the etcd member.
Let the KCP controller to create and manage the underlying MHC to enforce opinionated and safe values i.e maxUnhealthy based on current replicas.

vincepri commented 4 years ago

For control plane nodes, I'd prefer to go down the path of the annotation approach as outlined in https://github.com/kubernetes-sigs/cluster-api/issues/2836#issuecomment-609932576 (and discuss to do the same for Machines in a MachineSet). This gives the controller owner ability to apply any remediation steps it deems necessary before proceeding and it separates the controllers responsibilities even more.

If you're all +1 on it, we should code up the changes in the MHC controller first. After that, and after the current work on KCP is merged, we follow-up with changes to KCP to handle the annotation properly.

ncdc commented 4 years ago

+1 to annotation approach.

JoelSpeed commented 4 years ago

I think using annotations will make sense for the control plane as it needs the special handling for scaling down and back up.

As for annotations in general, I have some thoughts, provided the Machine is guaranteed to be deleted, then they're ok, it's just a proxy for the delete request essentially. My concern with annotations in the general sense comes from scenarios where the Machine is not going to be deleted (baremetal?). It becomes hard to define when the machine has been remediated. Who's responsible for removing the annotation from the Machine? And when? And then what prevents MHC immediately marking it unhealthy again? You kind of have to wipe the status from the Machine/reset it if you will to allow the MHC to treat it as a new Machine, but is wiping the status desirable in a scenario like that?

vincepri commented 4 years ago

Who's responsible for removing the annotation from the Machine?

I'd expect MHC to keep checking a Machine and remove the annotation if it goes back to healthy

And then what prevents MHC immediately marking it unhealthy again?

Nothing should prevent this imo

You kind of have to wipe the status from the Machine/reset it if you will to allow the MHC to treat it as a new Machine, but is wiping the status desirable in a scenario like that?

MHC shouldn't know/understand that a Machine has been reset, I would expect an external reconciler to do thing out of band, and outside core functionality. For example, if an external controller changes the NodeRef on the Machine's status that's not a supported behavior from CAPI perspective, but MHC shouldn't care what's in the value, but solely read it and do the health checks necessary.

I think it'd be great to discuss this in a little more details in tomorrow's meeting.

enxebre commented 4 years ago

For control plane nodes, I'd prefer to go down the path of the annotation approach as outlined in #2836 (comment) (and discuss to do the same for Machines in a MachineSet). This gives the controller owner ability to apply any remediation steps it deems necessary before proceeding and it separates the controllers responsibilities even more.

MHC is intentionally simple today and based on the principle that machines are "cattle". It's up to each controller (KCP, machineSet...) to react gracefully and protectively if they choose to when a MHC or something else requests a machine deletion.

I think consumers of the machine semantics shouldn't care about this in their designs. In the same way that they don't care about pods, they just request a machine deletion and we just honour PDBs.

Is there any particular flaw/concern you see with this approach https://github.com/kubernetes-sigs/cluster-api/issues/2836#issuecomment-611067632 for this scenario?

I see room for flexibility and pluggable remediation strategies in certain environments. In particular in a given baremetal environment a reboot might be preferred over deletion.

For articulating and figuring the best mechanism to allow this pluggability (annotations, API, crd...) I'd expect a RFE extending the original MHC one so we can flesh out the motivations, e2e details and workflow. If we want to deviate remediation behaviour also for pools of machines e.g control plane machines we should probably consider it just another use case in that wider discussion.

vincepri commented 4 years ago

MHC is intentionally simple today and based on the principle that machines are "cattle". It's up to each controller (KCP, machineSet...) to react gracefully and protectively if they choose to when a MHC or something else requests a machine deletion.

Annotations wouldn't change that, it would really only separate the responsibilities of a health check and the remediation process.

Is there any particular flaw/concern you see with this approach #2836 (comment) for this scenario?

"Add ability to KCP controller to watch a deletion timestamp and remove the etcd member" this isn't far from what we'll have today in #2841. The biggest drawback of this approach is that we cannot control when the underlying infrastructure is getting deleted, and we cannot scale up before scaling down.

I see room for flexibility and pluggable remediation strategies in certain environments. In particular in a given baremetal environment a reboot might be preferred over deletion.

See https://github.com/kubernetes-sigs/cluster-api/issues/2846

For articulating and figuring the best mechanism to allow this pluggability (annotations, API, crd...) I'd expect a RFE extending the original MHC one so we can flesh out the motivations, e2e details and workflow. If we want to deviate remediation behaviour also for pools of machines e.g control plane machines we should probably consider it just another use case in that wider discussion.

If we can reach consensus on annotations and separation of responsibilities, we follow-up and amend the proposal.

enxebre commented 4 years ago

The biggest drawback of this approach is that we cannot control when the underlying infrastructure is getting deleted, and we cannot scale up before scaling down.

A MHC managed by the KCP enforcing the maxUnhealthy value based on current replicas would protect you from shooting yourself in the foot, that's why we have maxUnhealhty semantic.

it would really only separate the responsibilities of a health check and the remediation process.

Cool, I'm all good with that. MHC should be just an integration point between node problem detector tooling and remediation strategies (if there's a need for them). I just think we should make sure we flesh out the details in a doc before adding any complexity and make sure we don't degrade our UX.

ncdc commented 4 years ago

A MHC managed by the KCP enforcing the maxUnhealthy value based on current replicas would protect you from shooting yourself in the foot, that's why we have maxUnhealhty semantic.

KCP has to scale up first, then delete. Can you help me understand how MHC allows for this?

If we separate the concerns, and have MHC only mark, and some other thing(s) remediate (sweep), I think we'll be able to make everyone happy. WDYT?

n1r1 commented 4 years ago

Who's responsible for removing the annotation from the Machine?

I'd expect MHC to keep checking a Machine and remove the annotation if it goes back to healthy

It's a bit tricky. After remediation, we end up with a healthy Machine with unhealthy annotation, and now we have a race - MHC will want to remove the unhealthy annotation, but also the remediation-controller will have to avoid re-remeidate it (i.e. remediation-loop)

It could be avoided by having that remediation-controller adding extra annotation to differentiate between unhealthy Machine and unhealthy Machine that just got remediated. It's still not ideal, since the remediation-controller might want to retry remediation (or try different remediation actions) if MHC still thinks that the remediated Machine is unhealthy, but since remediation-controller has no knowledge of the unhealthy conditions (or doesn't want to duplicate MHC logic), it doesn't know if MHC thinks that remediation has failed or not.

fabriziopandini commented 4 years ago

my two cents. I'm +1 on for annotations, and possible with annotations when the machine is starting to fail an MHC check and when the machine is finally scheduled for delete.

The machine is starting to fail check annotations is something that might not be relevant for this issue, but it could be a super valuable source of information for implementing conditions on machines.

benmoss commented 4 years ago

Seems like we need a separate state of "remediation applied" that KCP would apply if we want to handle a machine going from unhealthy to "maybe healthy" and have the MHC be the one who decides that it's fully healthy again

vincepri commented 4 years ago

This sounds more like a matter of configuration options in MHC and timeouts, rather than having a full state machine.

Understanding that remediation has been applied is something that the remediating controller needs to track, not a health check.

JoelSpeed commented 4 years ago

I think the problem to solve is how we healthcheck/mark unhealthy/remediate in an idempotent way, as @n1r1 says, we don't want to get into a loop. Currently, MHC signals remediation by deleting the machine, which is irreversible. Adding and removing annotations because of the health status changes rather changes the current beahviour.

My concern would be that MHC marks the machine unhealthy with an annotation, the remediating controller starts the remediation process, eg for baremetal, IPMI reboot the machine, then it somehow needs to know it's already remediated so that is doesn't re-remediate?

Does it remove the annotation? If it does, MHC will probably put it straight back (unless it clears the Machine status completely and re-triggers the new node timeout (nodeStartupTimeout))

Does it add another annotation to say it's remediated? If so, it then needs to remove it again once the machine goes healthy? What if it never goes healthy?

Perhaps MHC needs to have a timeout between remediations, a lastRemediation timestamp that it then waits for a period between any remediations. The remediation controller can do it's remediation, remove the annotation, and then, after this timeout (15 mins from lastRemediation?), would the MHC check if it's healthy or not again, I think that gives the remediating controller some way to do it's work idempotently? (Another thought, lastRemediation could also be an annotation on each machine so that the individual machines are only marked for remediation every 15 minutes or so?)

I am concerned this is going to end up being a bit state machiney

I don't think any of this really applies to the KCP model if it's going to remove the machines always, but that won't necessarily be the case if KCP is on baremetal will it?

jan-est commented 4 years ago

We have published a MachineHealthCheck enhancement proposal in google docs. The document is now open to everyone's comments CAPI MachineHealthCheck enhancement

vincepri commented 4 years ago

Moving this out of v0.3.4 for now, if we get the proposal merged in this cycle, we can target the full support in the next version.

/milestone v0.3.x

vincepri commented 4 years ago

/assign @benmoss

vincepri commented 4 years ago

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

k8s-ci-robot commented 4 years ago

@vincepri: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/2836#issuecomment-620116745): >As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes. > >Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api

Self healing control plane: Enable MHC to watch control plane machines #2836