medik8s / fence-agents-remediation

Kubernetes Operator for providing high availability between nodes by automatically remediating them using well-known fence-agents.
https://www.medik8s.io/
Apache License 2.0
9 stars 8 forks source link
availability k8s kubernetes kubernetes-node kubernetes-operator maintenance operator remediation

Fence Agents Remediation (FAR)

fence-agents-remediation (FAR) is a Kubernetes operator that uses well-known agents to fence and remediate unhealthy nodes. The remediation includes rebooting the unhealthy node using a fence agent, and then evicting workloads from the unhealthy node. The operator is recommended when a node becomes unhealthy, and we want remediate it by completely isolating the node from a cluster and help with recovering its workload. Isolation is needed, since we can’t “trust” the unhealthy node, to prevent it from accessing the shared resources like RWO volumes, and recovering the workloads helps to accelerate and keep their running time.

FAR is one of the remediator operators by Medik8s, such as Self Node Remediation and Machine Deletion Remediation, that were designed to run with the Node HealthCheck Operator (NHC) which detects an unhealthy node and creates remediation Custom Resource (CR). It is recommended to use FAR with NHC for an easier and smoother experience by fully automating the remediation process, but it can be used as a standalone remediator for the more experienced user. Moreover, like other Medik8s operators FAR was generated using the operator-sdk, and it supports Operator Lifecycle Manager (OLM).

About Fence Agents

FAR uses a fence agent to fence a Kubernetes node. Generally, fencing is the process of taking unresponsive/unhealthy computers into a safe state and isolating the computer. Fence agent is a software "driver" which is able to prevent nodes from destroying data on shared storage, and it aimed for isolating corrupted nodes. The isolation with FAR is mostly power-based fencing which enables power-cycling, resetting, or turning off the computer.

FAR uses some of the fence agents from the upstream repository by the ClusterLabs group. For example, fence_ipmilan for Intelligent Platform Management Interface (IPMI) environments or fence_aws for Amazon Web Services (AWS) platform. These upstream fence agents are Python scripts that are used to isolate a corrupted node from the rest of the cluster in a power-based fencing method. When a node is switched off, it cannot corrupt any data on shared storage. The fence agents use command-line arguments rather than configuration files, and to understand better the parameters you can view the fence agent's metadata (e.g., fence_ipmilan -o metadata).

Advantages

How does FAR work?

The operator watches for new or deleted CRs called FenceAgentsRemediation (or far) which trigger remediation for the node, based on the CR's name. When the CR name doesn't match a node in the cluster, then the CR won't trigger any remediation by FAR. Remediation includes adding a taint on the node, rebooting the node by fence agent, and at last deleting the remaining workloads.

FAR remediates by simply rebooting the unhealthy node, and moving any remaining workloads to other nodes, so they can continue running and be isolated from the unhealthy node. The reboot is done by executing a fence agent for the unhealthy node while evicting the workloads from this node is achieved by tainting the node and deleting the workloads. FAR unique taint, medik8s.io/fence-agents-remediation, has a NoExecute effect, so any pods that don't tolerate this taint are evicted immediately, and they won't be scheduled again after the node has been rebooted as long as the taint remains (the taint is removed on FenceAgentsRemediation CR deletion). Deleting the workloads is done to speed up Kubernetes rescheduling of the remaining pods (most likely stateful pods), that are not running anymore.

FAR includes the FenceAgentsRemediationTemplate (or fartemplate) Custom Resource Definition (CRD) for how to create a FenceAgentsRemediation CR. The template has the same fields as far CR (e.g., agent name) and it is used for automatically creating remediation CR by another operator/mechanism (e.g., NHC). The other operator is responsible of creating (and eventually deleting) the FenceAgentsRemediation CR with the name of the unhealthy node, even though FAR can be used manually without fartemplate and an additional operator (see standalone FAR).

Operator Workflow

Prerequisites

Workflow

  1. FAR adds NoExecute taint to the failed node => Ensure that any workloads are not executed after rebooting the failed node, and any stateless pods (that can’t tolerate FAR NoExecute taint) will be evicted immediately
  2. FAR reboots the failed node via the Fence Agent => After rebooting, there are no workloads in the failed node
  3. FAR forcefully deletes the pods in the failed node => The scheduler understands that it can schedule the failed pods on a different node
  4. After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoExecute taint in Step 2 is removed, and the node becomes schedulable again

FenceAgentsRemediation CR Status

The FenceAgentsRemediation CR status includes three conditions: Processing, FenceAgentActionSucceeded, and Succeeded. Each condition has a status (true/false/unknown), a message, and a reason which indicates the state of the condition until it is met. Using these conditions we can understand better the state of the CR, and if an error occurred. For example, see the below FenceAgentsRemediation CR status and the conditions state for a successful remediation.

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediation
metadata:
  name: NODE_NAME
spec: 
.
.
.
status:
  conditions:
    - type: Processing
      message: >-
        The unhealthy node was fully remediated (it was tainted, fenced using
        the fence agent and all the node resources have been deleted)
      reason: RemediationFinishedSuccessfully
      status: 'False'
    - type: FenceAgentActionSucceeded
      message: >-
        FAR taint was added and the fence agent command has been created and
        executed successfully
      reason: FenceAgentSucceeded
      status: 'True'   
    - type: Succeeded
      message: >-
        The unhealthy node was fully remediated (it was tainted, fenced using
        the fence agent and all the node resources have been deleted)
      reason: RemediationFinishedSuccessfully
      status: 'True'
  lastUpdateTime: '2024-01-30T10:49:46Z'

FAR Remediation Events

The operator emits remediation events on the node and the remediation CR for better understanding of the remediation process. Some important remediation events are FenceAgentSucceeded, and RemediationFinished which signifies that the fence agent command was succeeded and that the remediation was completed. All the remediation events of FAR (as well as other Medik8s operators) has a message that begins with [remediation]. Therefore, to easily filter these events run oc get events -A | awk '/\[remediation\]/ || NR==1' to get any remediation event or oc get events -A | awk '/\[remediation\]/ && /worker-1/ || NR==1' for getting any remediation event for node and CR of name worker-1.

Installation

There are three ways to install the operator:

Deploy the latest version

After every PR is merged to the main branch, then the images are built and pushed to quay.io (due to the post-submit job ). For deployment of FAR using these images you need:

Note: Installing FAR on a new namesapce (e.g., ns) requires setting some labels on the namespace prior to installing FAR:

kubectl label --overwrite ns olm security.openshift.io/scc.podSecurityLabelSync=false
kubectl label --overwrite ns olm pod-security.kubernetes.io/enforce=privileged

Deploy from the Kubernetes community

Go to OperatorHub, click on Install, and follow the instructions on how to install the operator on Kubernetes.

Build and deploy from sources

Usage

FAR is recommended for use with NHC to automate high availability for unhealthy nodes since NHC detects unhealthy nodes and it can create an external remediation CR, e.g., FenceAgentsRemediation CR, for unhealthy nodes. This automated way gives the responsibility on FenceAgentsRemediation CRs (creation and deletion) to NHC, even though FAR can also act as a standalone remediator, but it comes with the expense from the advanced administrator to identify the nodes' health for creating (and eventually) deleting these CRs.

Either way, a user must be familiar with the fence agent to be used. Know the fence agent parameters, and any other requirements on the cluster (e.g., fence_ipmilan needs machines that support IPMI).

FAR with NHC

Example FenceAgentsRemediationTemplate CR

The fartemplate CR is created by the administrator, and NHC can use it for creating a remediation CR, e.g. FenceAgentsRemediation. For a better understanding please see the below example of a dummy fartemplate object:

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
  name: fenceagentsremediationtemplate-default
  namespace: default
spec:
  template: {}

Note: FenceAgentsRemediationTemplate CR must be created in the same namespace that the FAR operator has been installed.

Configuring NodeHealthCheck to use the example fenceagentsremediationtemplate-default template above.

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nodehealthcheck-sample
spec:
  remediationTemplate:
    apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
    kind: FenceAgentsRemediationTemplate
    name: fenceagentsremediationtemplate-default
    namespace: default

NHC creates FenceAgentsRemediation CR using fartemplate after it detects an unhealthy node (according to NHC's unhealthy conditions). FenceAgentsRemediation CRs are deleted by NHC after it detects the node is healthy again.

Standalone FAR

Example FenceAgentsRemediation CR

The FAR CR, FenceAgentsRemediation, is created by the admin and is used to trigger the fence agent on a specific node. The CR includes the following parameters:

The FenceAgentsRemediation CR is created by the administrator and is used to trigger the fence agent on a specific node. The CR includes an agent field for the fence agent name, sharedparameters field with all the shared, not specific to a node, parameters, and a nodeparameters field to specify the parameters for the fenced node. For better understanding please see the below example of FenceAgentsRemediation CR for node worker-1 (see it also as the sample FAR):

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediation
metadata:
  name: worker-1
spec:
  agent: fence_ipmilan
  retrycount: 5
  retryinterval: "5s"
  timeout: "60s"
  sharedparameters:
    --username: "admin"
    --password: "password"
    --lanplus: ""
    --action: "reboot"
    --ip: "192.168.111.1"
  nodeparameters:
    --ipport:
      master-0: "6230"
      master-1: "6231"
      master-2: "6232"
      worker-0: "6233"
      worker-1: "6234"
      worker-2: "6235"
  remediationStrategy: ResourceDeletion

Tests

Run code checks and unit tests

Run make test

Run e2e tests

  1. Deploy the operator as explained above
  2. (Only for AWS platforms) Run make ocp-aws-credentials to add sufficient CredentialsRequest.
  3. Export the operator installed namespace (e.g., openshift-workload-availability) before running the e2e test: export OPERATOR_NS=openshift-workload-availability && make test-e2e

Run Scorecard tests

Run make test-scorecard on a running Kubernetes cluster to statically validate the operator bundle directory using Scorecard.

Troubleshooting

  1. Watch the FenceAgentsRemediation CR status conditions value, message, and reason for better understanding whether the fence agent action succeeded and the remediation completed.
  2. Watch for the emitted remediation events at FenceAgentsRemediation CR or the remediated node for easier identification of the remediation process.
  3. Investigate FAR’s pod logs in the container manager (kubectl logs -n <INSTALLED_NAMESPACE> --selector='app.kubernetes.io/name=fence-agents-remediation-operator' -c manager).
  4. Use Medik8s's team must-gather (for OCP only) by running oc adm must-gather --image=quay.io/medik8s/must-gather. It collects some related debug information for FAR and the rest of the Medik8s team operators.

Help

Feel free to join our Google group to get more info - Medik8s Google Group.