Closed JoelSpeed closed 3 years ago
I like this idea. Would it make sense to also indicate how many total nodes are being watched by the MHC or a count of the healthy nodes? This would give you a little more insight into what it's thinking. For example, if you're expecting the MaxUnavailable to be 3, but it's 0, it might not be clear why. If you also see that target nodes is 0, it's an indication your selector isn't working right.
Would it make sense to also indicate how many total nodes are being watched by the MHC or a count of the healthy nodes?
We have this already in the form of expectedMachines
and currentHealthy
Edit: Added an example in the PR description based on this comment, thanks
We have this already in the form of
expectedMachines
andcurrentHealthy
Edit: Added an example in the PR description based on this comment, thanks
Shows what I know ;) Perfect, that looks good to me.
/priority important-longterm /milestone v0.3.x /help
@vincepri: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
/assign
Maybe I'm over thinking it, but it's coming across a little confusing now:
spec:
maxUnhealthy: 40%
status:
expectedMachines: 8
currentHealthy: 6
allowedRemediations: 1
So, this shows I'm down 2 nodes already, but my allowedRemediations is 1. This is confusing. I think the allowedRemediations is actually 3.
Seems like we should change allowedRemediations to maxUnhealthy as a number, and add some messaging.
status:
expectedMachines: 8
currentHealthy: 6
maxUnhealthy: 3
remediationStatus:
enabled: true
reason: "node health within limits"
transitionTime: "00:00:00"
and
status:
expectedMachines: 8
currentHealthy: 4
maxUnhealthy: 3
remediationStatus:
enabled: false
reason: "node health exceeds limits, cannot remediate additional machines"
transitionTime: "00:00:00"
Ideally, if we had some mechanism to keep track of in-flight remediations, that would add to the clarity.
So, this shows I'm down 2 nodes already, but my allowedRemediations is 1. This is confusing. I think the allowedRemediations is actually 3.
This came from: 40% of 8 is 3.2, we round down to be safe, so 3 is the maximum number of unhealthy machines that are allowed, there are 2 already, so 1 further remediation is allowed. @vincepri had a similar comment https://github.com/kubernetes-sigs/cluster-api/pull/3372#discussion_r458226403 about the naming, so happy to revise it if we can decide on something. For context, I had taken inspiration from the allowedDisruptions
field of the PDB resource, so this should be a familiar concept to people who have used those because it works on the same logic
Seems like we should change allowedRemediations to maxUnhealthy as a number, and add some messaging.
In a lot of cases, are we not just reflecting the number from the spec then? Whenever someone sets a specific number this will be redundant
remediationStatus: enabled: false reason: "node health exceeds limits, cannot remediate additional machines" transitionTime: "00:00:00"
We could do this yeah, I might suggest we swap enabled
for blocked
so that the default case can be false
though, just thinking what happens if on creation with no value set, it would default to the false
and without a message it could be confusing
. For context, I had taken inspiration from the
allowedDisruptions
field of the PDB resource
Yeah, I realize that. But, while it's similar, I think they are slightly different. AllowedDisruptions implies allowing the user to do something disruptive, and if the number is too low, they can't. Whereas, AllowedRemediations is about fixing or un-disrupting.
I think some of it might just be the wording of "Allowed Remediations" but I can't really think of a better name. In particular, if the value is 0, that means we're at the max, which is okay AFAIK, which means remediations can take place, but is definitely confusing.
I might suggest we swap enabled for blocked
Yeah, I like this, it's more clear.
In particular, if the value is 0, that means we're at the max, which is okay AFAIK, which means remediations can take place, but is definitely confusing.
To add to confusion maybe(?), yes, I think so, if we are at 0 at the end of a reconciliation, that means we may have made some remediation during that reconcile. But on the next reconcile, if there are any new unhealthy machines, we won't be able to. If you observe a zero value, you shouldn't expect any more remediations until a machine has become healthy again.
If others like the remediationStatus
idea then I can look into implementing that when I'm back at the tail end of next week, I think it makes it clearer, just don't want to bowl ahead without a few acks
I think so, if we are at 0 at the end of a reconciliation, that means we may have made some remediation during that reconcile.
Right, which is another contrast with PDBs, PDBs reflect the state of 'as of right now, you cannot make any more disruptions' which I think is different from 'You're currently at 0, but we might be remediating one, or might remediate one of these broken ones in the future provided nothing else breaks.'
I don't think it's something we can clear up with just numbers, I like having the 'blocked' and 'reason' fields to provide some more concrete context to the user.
I don't think it's something we can clear up with just numbers, I like having the 'blocked' and 'reason' fields to provide some more concrete context to the user.
+1, seems like a good use case to get a condition in place
I'll look into updating the PR to use conditions in the next couple of days
/milestone v0.3.9
/milestone v0.3.10
/close the linked PR merged
@fabriziopandini: Closing this issue.
User Story
As a user I would like to know how many Machines an MHC thinks it can disrupt before short circuiting to increase confidence that the MHC will not break my cluster when it needs to make remediations.
Detailed Description
I think we should add a new field to the MHC status called
allowedRemediations
that will give users some insight into how the MHC is going to make decisions when it comes across an unhealthy machine.We already calculate the maximum number of unhealthy nodes in
isAllowedRemediation
, I would expect the value of this field to bemaxUnhealthy - unhealthy
, except when unhealthy is greater than maxUnhealthy in which case we would set it to 0, implying that short circuiting of the remediation will happen for any currently unhealthy, unremediated machines https://github.com/kubernetes-sigs/cluster-api/blob/b9bcc8c55bea0cad1b22cd872162d06c8687befe/controllers/machinehealthcheck_controller.go#L421-L434While this can be calculated from entries in the status already, I think it would be clearer for users if it were on the status for them (this would also prevent mismatching of expectations of the
maxUnhealthy
field since it would give a definitive answer for how many machines the MHC thinks it's allowed to remove)eg:
Anything else you would like to add:
Happy to raise a PR for this, but thought I should seek feedback first on this idea
/kind feature