medik8s / self-node-remediation

Automatic repair for unhealthy Kubernetes nodes
https://www.medik8s.io/
Apache License 2.0
45 stars 17 forks source link

Improve safe time to reboot handling and more #214

Closed slintes closed 4 months ago

slintes commented 4 months ago

Why we need this PR

There are several issues in the current implementation of handling the safe time to reboot. Context: the safe time reboot determines how long we wait until we assume that the node rebooted and no workloads are running anymore, before continuing with the remediation process and accelerating workload rescheduling by deleting pods, or using the ungraceful node shutdown feature by applying the OutOfService taint. That time can be set in the SNRConfig (SafeTimeToAssumeNodeRebootedSeconds). We also calculate a minimum, which depends on other values in the config, as well as cluster size and watchdog timeout. The minimum time is calculated by all SNR agents on pod start, stored by unhealthy node's agent on the SNR CR during remediation, and then used by both the unhealthy node's agent and the manager for further remediation.

The issues are:

a) the agents crashloop in case the calculated time is lower than the specified time. b) our own default value can be too low. c) the calculated time might not be accurate anymore during remediation because of cluster size change. d) the unhealthy node's agent might not be able to store it's calculated time during remediation, so the manager won't use it it and fallback to the specified value. e) even when the agent would be able to store the value, there is a race condition with the manager which wants to store the specified value. f) both unhealthy agent and the manager are running most of the remediation code in parallel, which leads to unneeded conflicts on resource updates, and hard to understand and maintain code execution flows.

Changes made

These changes fix a+c+d+e:

This fixes a+b:

This fixes f:

More fixes done during development:

Which issue(s) this PR fixes

Fixes ECOPROJECT-1875 and more Supersedes https://github.com/medik8s/self-node-remediation/pull/197

Test plan

openshift-ci[bot] commented 4 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

slintes commented 4 months ago

/test 4.15-openshift-e2e

slintes commented 4 months ago

/test 4.15-openshift-e2e

slintes commented 4 months ago

/test 4.15-openshift-e2e

slintes commented 4 months ago

/test 4.15-openshift-e2e

slintes commented 4 months ago

/test 4.15-openshift-e2e

slintes commented 4 months ago

/test 4.15-openshift-e2e

slintes commented 4 months ago

/test 4.15-openshift-e2e

slintes commented 4 months ago

/test 4.15-openshift-e2e

slintes commented 4 months ago

/retest

is one of the new tests flaky...?

/test 4.12-test /test 4.13-test /test 4.14-test /test 4.15-test /test 4.16-test

slintes commented 4 months ago

crap, retest was wrong for the github workflow of course 🙈

clobrano commented 4 months ago

/lgtm Keeping the "hold" because there are other threads open /hold

openshift-ci[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/medik8s/self-node-remediation/blob/main/OWNERS)~~ [clobrano,slintes] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
slintes commented 4 months ago

all discussions resolved

/hold cancel