medik8s / fence-agents-remediation

Kubernetes Operator for providing high availability between nodes by automatically remediating them using well-known fence-agents.
https://www.medik8s.io/
Apache License 2.0
9 stars 8 forks source link

Unable to fence nodes with `fence_azure_arm` agent #90

Closed jcanocan closed 11 months ago

jcanocan commented 11 months ago

Hi!

I'm currently playing around with FAR with Azure VMs. I've been able to install NHC, FAR in an OCP 4.13 cluster, to create the FAR Template and start the remediation process. This is the FAR Template I'm currently using:

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
  name: fenceagentsremediationtemplate-default
  namespace: openshift-operators
spec:
  template:
    spec:
      sharedparameters:
        '--action': reboot
        '-l': ea6bxxx
        '-p': y~xxx
        '--resourceGroup': jcano-cluster-mfxww-rg
        '--tenantId': 60xxx
        '--subscriptionId': 89xxx
      nodeparameters:
        '--plug=':
          jcano-cluster-mfxww-master-0: jcano-cluster-mfxww-master-0
          jcano-cluster-mfxww-master-1: jcano-cluster-mfxww-master-1
          jcano-cluster-mfxww-master-2: jcano-cluster-mfxww-master-2
          jcano-cluster-mfxww-worker-germanywestcentral1-b58kw: jcano-cluster-mfxww-worker-germanywestcentral1-b58kw
          jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd: jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd
          jcano-cluster-mfxww-worker-germanywestcentral3-xd7h5: jcano-cluster-mfxww-worker-germanywestcentral3-xd7h5
      agent: fence_azure_arm

I've tried with fence_azure_arm tool standalone locally to restart a faulty VM where an OCP node is running. For that purpose, I stopped the kubelet process to bring a node to an unhealthy state, and it worked but requires a tiny modification, see: https://github.com/Azure/azure-sdk-for-python/issues/30983#issuecomment-1647081509

Nevertheless, it is not working along with FAR operator. It throws the following errors:

2023-10-10T15:08:07.128294848Z  INFO    controllers.FenceAgentsRemediation  Begin FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.128341449Z  INFO    controllers.FenceAgentsRemediation  Check FAR CR's name
2023-10-10T15:08:07.138883921Z  INFO    controllers.FenceAgentsRemediation  Finalizer was added {"CR Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.138914222Z  INFO    controllers.FenceAgentsRemediation  Updating Status Condition   {"processingConditionStatus": "True", "fenceAgentActionSucceededConditionStatus": "Unknown", "succededConditionStatus": "Unknown", "reason": "RemediationStarted", "LastUpdateTime": "2023-10-10 15:08:07.138913322 +0000 UTC m=+23184.695547222"}
2023-10-10T15:08:07.151777431Z  INFO    controllers.FenceAgentsRemediation  Finish FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.151923434Z  INFO    controllers.FenceAgentsRemediation  Begin FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.151954534Z  INFO    controllers.FenceAgentsRemediation  Check FAR CR's name
2023-10-10T15:08:07.152025935Z  INFO    controllers.FenceAgentsRemediation  Try adding FAR (Medik8s) remediation taint  {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.170359134Z  INFO    taints  Taint was added {"taint effect": "NoExecute", "taint list": [{"key":"node.kubernetes.io/unreachable","effect":"NoSchedule","timeAdded":"2023-10-10T15:03:06Z"},{"key":"node.kubernetes.io/unreachable","effect":"NoExecute","timeAdded":"2023-10-10T15:03:12Z"},{"key":"medik8s.io/fence-agents-remediation","effect":"NoExecute","timeAdded":"2023-10-10T15:08:07Z"}]}
2023-10-10T15:08:07.170395735Z  INFO    controllers.FenceAgentsRemediation  Fetch FAR's pod
2023-10-10T15:08:07.170512137Z  INFO    controllers.FenceAgentsRemediation  Combine fence agent parameters  {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.170539037Z  INFO    controllers.FenceAgentsRemediation  Execute the fence agent {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.340974815Z  ERROR   executer    Failed to run exec command  {"stdout": "", "stderr": "time=\"2023-10-10T15:08:07Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"fence_azure_arm\\\": executable file not found in $PATH\"\n", "error": "command terminated with exit code 255"}
github.com/medik8s/fence-agents-remediation/pkg/cli.executer.Execute
    /remote-source/app/pkg/cli/cliexecuter.go:92
github.com/medik8s/fence-agents-remediation/controllers.(*FenceAgentsRemediationReconciler).Reconcile
    /remote-source/app/controllers/fenceagentsremediation_controller.go:203
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226
2023-10-10T15:08:07.341030816Z  ERROR   controllers.FenceAgentsRemediation  Fence Agent response was a failure  {"CR's Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd", "error": "command terminated with exit code 255"}
github.com/medik8s/fence-agents-remediation/controllers.(*FenceAgentsRemediationReconciler).Reconcile
    /remote-source/app/controllers/fenceagentsremediation_controller.go:206
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226
2023-10-10T15:08:07.350733575Z  INFO    controllers.FenceAgentsRemediation  Finish FenceAgentsRemediation Reconcile

It looks like FAR it's not able to find the fence_azure_arm tool in PATH for its purpose.

Environment:

Thanks in advance!

clobrano commented 11 months ago

Hey @jcanocan,

it worked but requires a tiny modification, see: https://github.com/Azure/azure-sdk-for-python/issues/30983#issuecomment-1647081509

thank you for pointing this out, really appreciated!

It looks like FAR it's not able to find the fence_azure_arm tool in PATH for its purpose.

I think fence_azure_arm is not installed in FAR's image. Currently it installs fence-agents-all (and aws), but it doesn't seem it includes the azure one

https://github.com/medik8s/fence-agents-remediation/blob/b2d3419a73a73231b70e46eb4fb28b39194609a6/Dockerfile#L39C1-L42C24

 ➤  docker run --rm -it quay.io/clobrano/fence-agents-remediation-fencing-agents bash
[root@4f3bb118da07 /]# fence_a 
fence_amt_ws    fence_apc       fence_apc_snmp  fence_aws       
[root@4f3bb118da07 /]# fence_ 
fence_amt_ws           fence_brocade          fence_eaton_snmp       fence_hpblade          fence_ilo2             fence_ilo5             fence_imm              fence_kdump            fence_rsb              fence_vmware_soap
fence_apc              fence_cisco_mds        fence_emerson          fence_ibmblade         fence_ilo3             fence_ilo5_ssh         fence_intelmodular     fence_mpath            fence_sbd              fence_wti
fence_apc_snmp         fence_cisco_ucs        fence_eps              fence_idrac            fence_ilo3_ssh         fence_ilo_moonshot     fence_ipdu             fence_redfish          fence_scsi             fence_xvm
fence_aws              fence_compute          fence_evacuate         fence_ifmib            fence_ilo4             fence_ilo_mp           fence_ipmilan          fence_rhevm            fence_virt             
fence_bladecenter      fence_drac5            fence_heuristics_ping  fence_ilo              fence_ilo4_ssh         fence_ilo_ssh          fence_ipmilanplus      fence_rsa              fence_vmware_rest      
[root@4f3bb118da07 /]# fence_
jcanocan commented 11 months ago

Thanks for answering back! I'm glad to help :blush:

Regarding https://github.com/Azure/azure-sdk-for-python/issues/30983#issuecomment-1647081509. Looks like they are not motivated to make the change. Moreover, It will take some time to land. Therefore, what do you think about including the following command right after fence-azure-arm package installation?

RUN sed -i 's/\"instanceView\"/expand=\"instanceView\"/' /usr/sbin/fence_azure_arm 

I would agree that it's not a very clean solution, just a workaround. Nevertheless, it will allow the fence agent work.

clobrano commented 11 months ago

Looks like they are not motivated to make the change.

It seems they need to propagate the request to the right people :)

I would agree that it's not a very clean solution, just a workaround. Nevertheless, it will allow the fence agent work.

We actually want to decouple the operator's image from the one containing the agents so that one could use an image with a specific fencing agent and the related quirks to make it work.

razo7 commented 11 months ago

First of all thanks Javier for noticing/raising the notion of using Azure fence agent!

Looks like they are not motivated to make the change.

Yes, how about creating a PR with the above fix to https://github.com/ClusterLabs/fence-agents/tree/main repo? They are available in their mailing list if you want to discuss about if beforehand.

jcanocan commented 11 months ago

We actually want to decouple the operator's image from the one containing the agents so that one could use an image > with a specific fencing agent and the related quirks to make it work.

Thanks for letting me know. Sounds nice :)

First of all thanks Javier for noticing/raising the notion of using Azure fence agent!

Looks like they are not motivated to make the change.

Yes, how about creating a PR with the above fix to https://github.com/ClusterLabs/fence-agents/tree/main repo? They are available in their mailing list if you want to discuss about if beforehand.

Thanks for the suggestion. I misinterpreted the words in https://github.com/Azure/azure-sdk-for-python/issues/30983#issuecomment-1647081509, but I just realized that the azure fence agent is independent to the https://github.com/Azure/azure-sdk-for-python. Apologizes for the confusion. So I will try to post a PR fixing this issue in the fence agent.

Meanwhile, I will learn how to build the operator locally and deploy it in an OCP cluster.

jcanocan commented 11 months ago

Posted https://github.com/ClusterLabs/fence-agents/pull/562. Just in case you are curious :)