Closed slintes closed 9 months ago
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: slintes
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/test 4.14-openshift-e2e
/hold We can add a log message (and an event) for the soonMatchingNodes, so a user would know what is happening to the node that was unhealthy. /lgtm
/retest
as stated in the description, this needs a test!
/lgtm cancel
Node status changes to Foo/False at 12:00:20 Node status changes to Ready/True at 12:00:25 At this point I expect NHC to delete the CR. The only unhealthy condition is Foo which should only take effect at 12:10:20, but IIUC due this fix the CR will not be deleted.
Not sure about that. Would it really make sense to consider the node as healthy and delete the CR during that (most times) short time? 🤔 Also, I don't see a big difference to the 1st usecase... in the end it's about ignoring timeouts in case the node switches from one unhealthy condition to another one, no matter if it's the same condition type or not IMHO.
Not sure about that. Would it really make sense to consider the node as healthy and delete the CR during that (most times) short time? 🤔
It's a good question. I guess that if we try to follow the what would be the "correct" behavior as far as the user expects the answer would be yes. I also think that this is even more important in case a large timeout is configured for another condition type (as in the second use case), Ignoring it basically ignores the user intention.
Having said that, I do see value in reducing deletion/creation of remediation for short timeouts so I don't have any objection of excluding the deletion of the CR for short timeouts (maybe 10s < :thinking: ) .
The issue here is when NHC has two unhealthy node conditions for the same condition but with different statuses (e.g. Unknown and False statuses for the Ready condition), and when this condition status changes from an unhealthy status to a new unhealthy status (e.g. Unknown ->False). NHC shouldn't delete the CR and the node shouldn't be considered as healthy until this condition status is changed to a new "healthy" (or non- "unhealthy") status. But when this new "healthy status" is not a different unhealthy status condition, then the node is healthy (regardless of whether another unhealthy condition was met, and didn't expire), and it is safe to delete the CR. This way we delete the CR at 12:00:25 (for Michael's second use case), and we will consider the timeout of the first unhealthy condition that is met per condition (and ignore the second one).
For example for UnhealthyConditon A: Type: Ready, Status: Unknown, Duration 200s
, and UnhealthyConditon B: Type: Ready, Status: False, Duration 300s
we will only wait 200 seconds timeout.
Since, according to my observations on the Ready condition it will change from Unknown to False, thus it will change from A to B, and in the beginning, it will wait 200 seconds until NHC creates remediation CR and not an extra 300 seconds for B. So this way we ignore the second timeout (of B).
/lgtm As discussed in the team meeting, we also need to make sure that it is well documented that when we have a sequence of several unhealthy conditions (either of the same type or not) NHC only considers the timeout of the first one.
some e2e tests failed because of AWS limits, will wait for lgtm before retriggering
/hold cancel
/retest
/hold cancel
/override e2e-k8s
@slintes: Overrode contexts on behalf of slintes: e2e-k8s
/retest
Don't delete remediation CRs for nodes which soon match unhealthy conditions, because they might just have switched from one unhealthy condition to another, but the timeout of the new condition didn't expire yet (e.g. from Ready=Unknown to Ready=False).
Will add a test after PTO
ECOPROJECT-1782