kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.5k stars 1.3k forks source link

Add `allowedRemediations` field to MHC Status #3371

Closed JoelSpeed closed 3 years ago

JoelSpeed commented 4 years ago

User Story

As a user I would like to know how many Machines an MHC thinks it can disrupt before short circuiting to increase confidence that the MHC will not break my cluster when it needs to make remediations.

Detailed Description

I think we should add a new field to the MHC status called allowedRemediations that will give users some insight into how the MHC is going to make decisions when it comes across an unhealthy machine.

We already calculate the maximum number of unhealthy nodes in isAllowedRemediation, I would expect the value of this field to be maxUnhealthy - unhealthy, except when unhealthy is greater than maxUnhealthy in which case we would set it to 0, implying that short circuiting of the remediation will happen for any currently unhealthy, unremediated machines https://github.com/kubernetes-sigs/cluster-api/blob/b9bcc8c55bea0cad1b22cd872162d06c8687befe/controllers/machinehealthcheck_controller.go#L421-L434

While this can be calculated from entries in the status already, I think it would be clearer for users if it were on the status for them (this would also prevent mismatching of expectations of the maxUnhealthy field since it would give a definitive answer for how many machines the MHC thinks it's allowed to remove)

eg:

spec:
  maxUnhealthy: 40%
status:
  expectedMachines: 8
  currentHealthy: 6
  allowedRemediations: 1

Anything else you would like to add:

Happy to raise a PR for this, but thought I should seek feedback first on this idea

/kind feature

michaelgugino commented 4 years ago

I like this idea. Would it make sense to also indicate how many total nodes are being watched by the MHC or a count of the healthy nodes? This would give you a little more insight into what it's thinking. For example, if you're expecting the MaxUnavailable to be 3, but it's 0, it might not be clear why. If you also see that target nodes is 0, it's an indication your selector isn't working right.

JoelSpeed commented 4 years ago

Would it make sense to also indicate how many total nodes are being watched by the MHC or a count of the healthy nodes?

We have this already in the form of expectedMachines and currentHealthy

Edit: Added an example in the PR description based on this comment, thanks

michaelgugino commented 4 years ago

We have this already in the form of expectedMachines and currentHealthy

Edit: Added an example in the PR description based on this comment, thanks

Shows what I know ;) Perfect, that looks good to me.

vincepri commented 4 years ago

/priority important-longterm /milestone v0.3.x /help

k8s-ci-robot commented 4 years ago

@vincepri: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/3371): >/priority important-longterm >/milestone v0.3.x >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
JoelSpeed commented 4 years ago

/assign

michaelgugino commented 4 years ago

Maybe I'm over thinking it, but it's coming across a little confusing now:

spec:
  maxUnhealthy: 40%
status:
  expectedMachines: 8
  currentHealthy: 6
  allowedRemediations: 1

So, this shows I'm down 2 nodes already, but my allowedRemediations is 1. This is confusing. I think the allowedRemediations is actually 3.

Seems like we should change allowedRemediations to maxUnhealthy as a number, and add some messaging.

status:
  expectedMachines: 8
  currentHealthy: 6
  maxUnhealthy: 3
  remediationStatus:
    enabled: true
    reason: "node health within limits"
    transitionTime: "00:00:00"

and

status:
  expectedMachines: 8
  currentHealthy: 4
  maxUnhealthy: 3
  remediationStatus:
    enabled: false
    reason: "node health exceeds limits, cannot remediate additional machines"
    transitionTime: "00:00:00"

Ideally, if we had some mechanism to keep track of in-flight remediations, that would add to the clarity.

JoelSpeed commented 4 years ago

So, this shows I'm down 2 nodes already, but my allowedRemediations is 1. This is confusing. I think the allowedRemediations is actually 3.

This came from: 40% of 8 is 3.2, we round down to be safe, so 3 is the maximum number of unhealthy machines that are allowed, there are 2 already, so 1 further remediation is allowed. @vincepri had a similar comment https://github.com/kubernetes-sigs/cluster-api/pull/3372#discussion_r458226403 about the naming, so happy to revise it if we can decide on something. For context, I had taken inspiration from the allowedDisruptions field of the PDB resource, so this should be a familiar concept to people who have used those because it works on the same logic

Seems like we should change allowedRemediations to maxUnhealthy as a number, and add some messaging.

In a lot of cases, are we not just reflecting the number from the spec then? Whenever someone sets a specific number this will be redundant

 remediationStatus:
   enabled: false
   reason: "node health exceeds limits, cannot remediate additional machines"
   transitionTime: "00:00:00"

We could do this yeah, I might suggest we swap enabled for blocked so that the default case can be false though, just thinking what happens if on creation with no value set, it would default to the false and without a message it could be confusing

michaelgugino commented 4 years ago

. For context, I had taken inspiration from the allowedDisruptions field of the PDB resource

Yeah, I realize that. But, while it's similar, I think they are slightly different. AllowedDisruptions implies allowing the user to do something disruptive, and if the number is too low, they can't. Whereas, AllowedRemediations is about fixing or un-disrupting.

I think some of it might just be the wording of "Allowed Remediations" but I can't really think of a better name. In particular, if the value is 0, that means we're at the max, which is okay AFAIK, which means remediations can take place, but is definitely confusing.

I might suggest we swap enabled for blocked

Yeah, I like this, it's more clear.

JoelSpeed commented 4 years ago

In particular, if the value is 0, that means we're at the max, which is okay AFAIK, which means remediations can take place, but is definitely confusing.

To add to confusion maybe(?), yes, I think so, if we are at 0 at the end of a reconciliation, that means we may have made some remediation during that reconcile. But on the next reconcile, if there are any new unhealthy machines, we won't be able to. If you observe a zero value, you shouldn't expect any more remediations until a machine has become healthy again.

If others like the remediationStatus idea then I can look into implementing that when I'm back at the tail end of next week, I think it makes it clearer, just don't want to bowl ahead without a few acks

michaelgugino commented 4 years ago

I think so, if we are at 0 at the end of a reconciliation, that means we may have made some remediation during that reconcile.

Right, which is another contrast with PDBs, PDBs reflect the state of 'as of right now, you cannot make any more disruptions' which I think is different from 'You're currently at 0, but we might be remediating one, or might remediate one of these broken ones in the future provided nothing else breaks.'

I don't think it's something we can clear up with just numbers, I like having the 'blocked' and 'reason' fields to provide some more concrete context to the user.

vincepri commented 4 years ago

I don't think it's something we can clear up with just numbers, I like having the 'blocked' and 'reason' fields to provide some more concrete context to the user.

+1, seems like a good use case to get a condition in place

JoelSpeed commented 4 years ago

I'll look into updating the PR to use conditions in the next couple of days

vincepri commented 4 years ago

/milestone v0.3.9

vincepri commented 4 years ago

/milestone v0.3.10

fabriziopandini commented 3 years ago

/close the linked PR merged

k8s-ci-robot commented 3 years ago

@fabriziopandini: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/3371#issuecomment-716620088): >/close >the lined PR merged Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.