Closed slintes closed 2 months ago
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: slintes
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/lgtm
known console issue on 4.17 /override ci/prow/4.17-openshift-e2e
4.16 looks like a race between cleanup of NHC test, and running MHC test...
2024-08-06T20:32:23.051827263Z INFO controllers.MachineHealthCheck.resource manager external remediation CR already exists, but it's not owned by us
/test 4.16-openshift-e2e
/override ci/prow/4.17-openshift-e2e
@slintes: Overrode contexts on behalf of slintes: ci/prow/4.17-openshift-e2e
don't cherry pick before https://github.com/medik8s/node-healthcheck-operator/pull/343 is merged, in order to prevent merge conflicts
/cherry-pick release-0.8
@slintes: new pull request created: #344
Why we need this PR
In https://github.com/medik8s/node-healthcheck-operator/pull/301 we added support for multiple escalating remediations of the same remediation kind.
While investigating a report about false events about skipping control plane remediation I noticed that we have several issues related to that new feature, because we just checked the CR name when comparing with node names, instead of the new node name annotation.
Changes made
The first commit updates the unit tests in order reveal 3 of the (at least) 4 issues:
[FAIL] Node Health Check CR Reconciliation with a single escalating remediation with multiple same kind support when an old remediation cr exists [It] an alert flag is set on remediation cr
[FAIL] Node Health Check CR Reconciliation with expected permanent node deletion [It] it should delete orphaned CR when node is deleted
[FAIL] Node Health Check CR Reconciliation control plane nodes when two control plane nodes are unhealthy [It] should remediate one after another
-> this is about the false eventThe next 2 commits fix the issues in the code.
Which issue(s) this PR fixes
The false event was mentioned in: ECOPROJECT-2057
Test plan
As mentioned already, first commit updates tests to reveal the issues