medik8s / node-healthcheck-operator

K8s Node Health Check Operator
Apache License 2.0
91 stars 18 forks source link

Fix handling unhealthy condition change #274

Closed slintes closed 9 months ago

slintes commented 9 months ago

Don't delete remediation CRs for nodes which soon match unhealthy conditions, because they might just have switched from one unhealthy condition to another, but the timeout of the new condition didn't expire yet (e.g. from Ready=Unknown to Ready=False).

Will add a test after PTO

ECOPROJECT-1782

openshift-ci[bot] commented 9 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci[bot] commented 9 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/medik8s/node-healthcheck-operator/blob/main/OWNERS)~~ [slintes] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
slintes commented 9 months ago

/test 4.14-openshift-e2e

razo7 commented 9 months ago

/hold We can add a log message (and an event) for the soonMatchingNodes, so a user would know what is happening to the node that was unhealthy. /lgtm

razo7 commented 9 months ago

/retest

slintes commented 9 months ago

as stated in the description, this needs a test!

/lgtm cancel

slintes commented 9 months ago

Node status changes to Foo/False at 12:00:20 Node status changes to Ready/True at 12:00:25 At this point I expect NHC to delete the CR. The only unhealthy condition is Foo which should only take effect at 12:10:20, but IIUC due this fix the CR will not be deleted.

Not sure about that. Would it really make sense to consider the node as healthy and delete the CR during that (most times) short time? 🤔 Also, I don't see a big difference to the 1st usecase... in the end it's about ignoring timeouts in case the node switches from one unhealthy condition to another one, no matter if it's the same condition type or not IMHO.

mshitrit commented 9 months ago

Not sure about that. Would it really make sense to consider the node as healthy and delete the CR during that (most times) short time? 🤔

It's a good question. I guess that if we try to follow the what would be the "correct" behavior as far as the user expects the answer would be yes. I also think that this is even more important in case a large timeout is configured for another condition type (as in the second use case), Ignoring it basically ignores the user intention.

Having said that, I do see value in reducing deletion/creation of remediation for short timeouts so I don't have any objection of excluding the deletion of the CR for short timeouts (maybe 10s < :thinking: ) .

razo7 commented 9 months ago

The issue here is when NHC has two unhealthy node conditions for the same condition but with different statuses (e.g. Unknown and False statuses for the Ready condition), and when this condition status changes from an unhealthy status to a new unhealthy status (e.g. Unknown ->False). NHC shouldn't delete the CR and the node shouldn't be considered as healthy until this condition status is changed to a new "healthy" (or non- "unhealthy") status. But when this new "healthy status" is not a different unhealthy status condition, then the node is healthy (regardless of whether another unhealthy condition was met, and didn't expire), and it is safe to delete the CR. This way we delete the CR at 12:00:25 (for Michael's second use case), and we will consider the timeout of the first unhealthy condition that is met per condition (and ignore the second one).

For example for UnhealthyConditon A: Type: Ready, Status: Unknown, Duration 200s, and UnhealthyConditon B: Type: Ready, Status: False, Duration 300s we will only wait 200 seconds timeout. Since, according to my observations on the Ready condition it will change from Unknown to False, thus it will change from A to B, and in the beginning, it will wait 200 seconds until NHC creates remediation CR and not an extra 300 seconds for B. So this way we ignore the second timeout (of B).

mshitrit commented 9 months ago

/lgtm As discussed in the team meeting, we also need to make sure that it is well documented that when we have a sequence of several unhealthy conditions (either of the same type or not) NHC only considers the timeout of the first one.

slintes commented 9 months ago

some e2e tests failed because of AWS limits, will wait for lgtm before retriggering

/hold cancel

mshitrit commented 9 months ago

/retest

slintes commented 9 months ago

/hold cancel

slintes commented 9 months ago

/override e2e-k8s

openshift-ci[bot] commented 9 months ago

@slintes: Overrode contexts on behalf of slintes: e2e-k8s

In response to [this](https://github.com/medik8s/node-healthcheck-operator/pull/274#issuecomment-1881293400): >/override e2e-k8s Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
slintes commented 9 months ago

/retest