metal3-io / baremetal-operator

Bare metal host provisioning integration for Kubernetes
Apache License 2.0
568 stars 247 forks source link

BMO reports a RegistrationError when managing iDRAC BMC hosts #1339

Closed fracappa closed 11 months ago

fracappa commented 1 year ago

Issue Details

I'm currently using Metal3 to manage my Dell servers equipped with iDRAC BMCs. I've successfully set up a Kubernetes cluster using kubeadm and deployed the Bare Metal Operator (BMO) via the provided deploy.sh script. This is how my bmc manifest looks like:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
     name: fall-bmh
spec:
  online: true
  bootMACAddress: <MAC-of-a-physical-NIC>
  bootMode: UEFI
  bmc:
    address: idrac-redfish://<iDRAC-IP>:443/redfish/v1/Systems/System.Embedded.1
    credentialsName: bmc-credentials
    disableCertificateVerification: true

However, when attempting to register my machines by creating custom BareMetalHost (bmh) resources, I encounter a RegistrationError. Reviewing the logs of the baremetal-operator-controller-manager, I found this message:

`{"level":"info","ts":"2023-08-24T13:52:32Z","logger":"controllers.BareMetalHost","msg":"publishing event","baremetalhost":{"name":"fall-bmh","namespace":"metal3"},"reason":"RegistrationError","message":"<Node-ID> failed verify step clear_job_queue with unexpected error: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)"}
{"level":"info","ts":"2023-08-24T13:52:32Z","logger":"controllers.BMCEventSubscription","msg":"start","bmceventsubscription":{"name":"fall-bmh","namespace":"metal3"}}
{"level":"info","ts":"2023-08-24T13:52:32Z","logger":"controllers.BMCEventSubscription","msg":"done","bmceventsubscription":{"name":"fall-bmh","namespace":"metal3"}}
{"level":"info","ts":"2023-08-24T13:52:32Z","logger":"controllers.BareMetalHost","msg":"done","baremetalhost":{"name":"fall-bmh","namespace":"metal3"},"provisioningState":"registering","requeue":false,"after":425.166795068}
`

Expectation

It seems the BMO receives a poorly formatted message from iDRAC. However, accessing the URL https://<iDRAC-IP>:443/redfish/v1/Systems/System.Embedded.1 directly in my browser yields a correctly formatted JSON. Therefore, I believe iDRAC is behaving correctly and is reachable, ruling out networking issues.

Additional information Although I'm not sure if it's relevant, I'm operating a single-node Kubernetes cluster using kubeadm and tainting it with kubectl taint nodes --all node-role.kubernetes.io/control-plane-.

Additionally, I logged into the ironic container issuing a curl on the same resource: https://<iDRAC-IP>:443/redfish/v1/Systems/System.Embedded.1 and successfully receiving the expected json.

Environment

/kind bug

metal3-io-bot commented 1 year ago

This issue is currently awaiting triage. If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance. The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Sunnatillo commented 1 year ago

Please take a look to this issue. @kashifest, @lentzi90 @Rozzii @dtantsur

elfosardo commented 1 year ago

FYI we've been talking about this on slack https://kubernetes.slack.com/archives/CHD49TLE7/p1693206310793309

fracappa commented 1 year ago

Hi there. I'm still dealing with the same problem for a week and I could not find any solution.

The only information I have noticed could be related to potential network configuration.

More specifically, I'm getting this error from the ironic-inspector container within the ironic pod:

2023-08-30 08:01:58.702 1 ERROR ironic_inspector.conductor.manager [-] The periodic ironic_inspector.conductor.manager.sync_with_ironic failed with: Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 290, in run
    work()
  File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 64, in __call__
    return self.callback(*self.args, **self.kwargs)
  File "/usr/lib/python3.9/site-packages/futurist/periodics.py", line 178, in decorator
    return f(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/ironic_inspector/conductor/manager.py", line 233, in sync_with_ironic
    ironic_node_uuids = {node.id for node in ironic_nodes}
  File "/usr/lib/python3.9/site-packages/ironic_inspector/conductor/manager.py", line 233, in <setcomp>
    ironic_node_uuids = {node.id for node in ironic_nodes}
  File "/usr/lib/python3.9/site-packages/openstack/resource.py", line 2077, in list
    exceptions.raise_from_response(response)
  File "/usr/lib/python3.9/site-packages/openstack/exceptions.py", line 263, in raise_from_response
    raise cls(
openstack.exceptions.HttpException: HttpException: 401: Client Error for url: http://<IP>:6385/v1/nodes?fields=uuid, Incorrect username or password
: None: None
2023-08-30 08:01:58.702 1 ERROR ironic_inspector.conductor.manager NoneType: None
2023-08-30 08:01:58.702 1 ERROR ironic_inspector.conductor.manager 

There seems to be an authentication error on the ironic endpoint, when trying to access the /v1/nodes subresource but I don't know anything about credentials, at least I didn't set any. Yet, I'm not sure this is related to the root issue. It should not be a connectivity issue since I can curl <ironic-ip>:port/v1/ and I correctly receive a response message.

Furthermore, I've noticed my ironic pod take a different NIC IP address than the provisioning interface's I atteched the pod to. Like, I have eno1 and eno2, eno1 with a static public IP used to ssh on the server and eno2 with a custom static IP (which in my case I use for the ironic endpoint, e.g. 172..22.0.1). I don't know if this could be the problem.

Thanks to anybody who can help me.

fracappa commented 1 year ago

I redeployed the k8s cluster with kubeadm without setting custom configuration (--config=config.yaml) and something changed.

Now the ironic-dnsmasq container, seems trying to give an IP address to the target host thorugh the provisioning interface. Taking its logs, I have this situation:

dnsmasq-dhcp: 925516945 DHCPDISCOVER(enp0s31f6) <MAC-address>
dnsmasq-dhcp: 925516945 tags: enp0s31f6
dnsmasq-dhcp: 925516945 DHCPOFFER(enp0s31f6) 172.23.0.33 <MAC-address>
dnsmasq-dhcp: 925516945 requested options: 1:netmask, 3:router, 12:hostname, 15:domain-name, 
dnsmasq-dhcp: 925516945 requested options: 6:dns-server, 26:mtu, 33:static-route, 121:classless-static-route, 
dnsmasq-dhcp: 925516945 requested options: 119:domain-search, 42:ntp-server, 120:sip-server
dnsmasq-dhcp: 925516945 bootfile name: /undionly.kpxe
dnsmasq-dhcp: 925516945 server name: 172.23.0.1
dnsmasq-dhcp: 925516945 next server: 172.23.0.1
dnsmasq-dhcp: 925516945 sent size:  1 option: 53 message-type  2
dnsmasq-dhcp: 925516945 sent size:  4 option: 54 server-identifier  172.23.0.1
dnsmasq-dhcp: 925516945 sent size:  4 option: 51 lease-time  1h
dnsmasq-dhcp: 925516945 sent size:  4 option: 58 T1  30m
dnsmasq-dhcp: 925516945 sent size:  4 option: 59 T2  52m30s
dnsmasq-dhcp: 925516945 sent size:  4 option:  1 netmask  255.255.255.0
dnsmasq-dhcp: 925516945 sent size:  4 option: 28 broadcast  172.23.0.255
dnsmasq-dhcp: 925516945 available DHCP range: 172.23.0.10 -- 172.23.0.100
dnsmasq-dhcp: 925516945 client provides name: fall
dnsmasq-dhcp: 3988913982 available DHCP range: 172.23.0.10 -- 172.23.0.100
dnsmasq-dhcp: 3988913982 client provides name: fall
dnsmasq-dhcp: 3988913982 DHCPDISCOVER(enp0s31f6) <MAC-address>
dnsmasq-dhcp: 3988913982 tags: enp0s31f6
dnsmasq-dhcp: 3988913982 DHCPOFFER(enp0s31f6) 172.23.0.33 <MAC-address>

After a while, I got a timeout and this will then be translated into an error during the registration phase of my BareMetalHost resource.

Does somebody have any clue why this happens?

Rozzii commented 1 year ago

/cc @Rozzii I will take a look also later, IMO it won't be a bug it looks like credential misconfiguration. I will remove the bug label and add question instead.

Rozzii commented 11 months ago

I have no idea about this I still think this is a misconfiguration. /help

metal3-io-bot commented 11 months ago

@Rozzii: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/metal3-io/baremetal-operator/issues/1339): >I have no idea about this I still think this is a misconfiguration. >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
fracappa commented 11 months ago

The problem was I didn't specify BMC credentials encoded in base64 and as such ironic decoded the plaintext credentials resulting in encoding problems. By encoding credentials I solved the problem.