medik8s / fence-agents-remediation

Kubernetes Operator for providing high availability between nodes by automatically remediating them using well-known fence-agents.
https://www.medik8s.io/
Apache License 2.0
9 stars 8 forks source link

[WIP] Support off action #124

Open k-keiichi-rh opened 9 months ago

k-keiichi-rh commented 9 months ago

This PR is to support off action.

The following is the FAR workflow with off action:

  1. FAR adds NoExecute taint to the failed node
  2. FAR powers off the failed node via the Fence Agent
  3. FAR deletes workloads in the failed node
  4. [User Intervention] Admins turn the failed node on after they check the failed node has been recovered.
  5. After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again

In step 4, if users want to do troubleshooting on the failed node, they need to manually add the proper taints before turning on the failed node. The document pr is tracked in https://issues.redhat.com/browse/ECOPROJECT-1756.

ECOPROJECT-1471

openshift-ci[bot] commented 9 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: k-keiichi-rh Once this PR has been reviewed and has the lgtm label, please assign razo7 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/medik8s/fence-agents-remediation/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci[bot] commented 9 months ago

Hi @k-keiichi-rh. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
mshitrit commented 9 months ago

Hi @k-keiichi-rh , since this PR is still a WIP I've converted it to a "draft" PR. We usually create our PRs as such in order to save cloud resources (draft PR doesn't run e2e tests automatically).

We try to follow this process:

Let me know if that makes sense.

mshitrit commented 9 months ago

/test 4.14-openshift-e2e

openshift-ci[bot] commented 9 months ago

@k-keiichi-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.14-openshift-e2e c5eb86063da8826f23cabcb002ac34b99b732960 link true /test 4.14-openshift-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
razo7 commented 9 months ago

/test 4.13-openshift-e2e