Closed alexkasatikov closed 8 months ago
As an idea, something like that could be used instead:
lspci -s $(ethtool -i $iname | grep bus-info | awk '{print $2}') | cut -d ':' -f 3
@guettli please have a look here
I ran into a similar error on a RX220 Host with the following versions:
Log:
{"level":"ERROR","time":"2023-10-06T17:33:42.789Z","file":"controller/controller.go:324","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","HetznerBareMetalHost":{"name":"bm-arm-01","namespace":"default"},"namespace":"default","name":"bm-arm-01","reconcileID":"3f8d7595-bc7f-4174-97d0-b7b49efbc96d","error":"failed to reconcile HetznerBareMetalHost default/bm-arm-01: action \"registering\" failed: failed to get hardware details: failed to obtain hardware details Nics: failed to unmarshal {\"name\":\"eth0\",\"model\":\"Intel Corporation I350 Gigabit Network Connection (rev 01)}. Original ssh output name=\"eth0\" model=\"Intel Corporation I350 Gigabit Network Connection (rev 01)\nIntel Corporation I350 Gigabit Network Connection (rev 01)\" mac=\"88:88:88:88:88:88\" ip=\"111.111.111.11/26\" speedMbps=\"1000\"\nname=\"eth0\" model=\"Intel Corporation I350 Gigabit Network Connection (rev 01)\nIntel Corporation I350 Gigabit Network Connection (rev 01)\" mac=\"88:88:88:88:88:88\" ip=\"2a01:2a01:2a01:2a01::2/64\" speedMbps=\"1000\": unexpected end of JSON input","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"}
Above log line, pretty printed:
❯ xclip -o | yq -P
level: ERROR
time: "2023-10-06T17:33:42.789Z"
file: controller/controller.go:324
message: Reconciler error
controller: hetznerbaremetalhost
controllerGroup: infrastructure.cluster.x-k8s.io
controllerKind: HetznerBareMetalHost
HetznerBareMetalHost:
name: bm-arm-01
namespace: default
namespace: default
name: bm-arm-01
reconcileID: 3f8d7595-bc7f-4174-97d0-b7b49efbc96d
error: |-
failed to reconcile HetznerBareMetalHost default/bm-arm-01: action "registering" failed:
failed to get hardware details: failed to obtain hardware details Nics:
failed to unmarshal {"name":"eth0","model":"Intel Corporation I350 Gigabit Network Connection (rev 01)}.
Original ssh output name="eth0" model="Intel Corporation I350 Gigabit Network Connection (rev 01)
Intel Corporation I350 Gigabit Network Connection (rev 01)" mac="88:88:88:88:88:88" ip="111.111.111.11/26" speedMbps="1000"
name="eth0" model="Intel Corporation I350 Gigabit Network Connection (rev 01)
Intel Corporation I350 Gigabit Network Connection (rev 01)" mac="88:88:88:88:88:88" ip="2a01:2a01:2a01:2a01::2/64"
speedMbps="1000": unexpected end of JSON input
stacktrace: |-
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226
Is there any work in progress regarding this issue? If not, would you accept a pull request?
@benedikt-bartscher yes, a PR is welcome. BTW, do you have an idea how to reproduce this error? Is there a way to create a second (fake) network interface somehow? Then I could validate you PR manually.
Hey @guettli thanks for your response. You could alias lspci
to echo some test data. I am not aware of any other trick which results in a "fake" NIC appearing in lspci
/ethtool
. Aren't your e2e tests sponsored by Hetzner? Maybe they can provide you a server with 2 NICs. If not, I could provide you with one of our machines for some coding/testing for free.
They will, no problem. If you can open a PR, we will be able to test it as well!
Hello guys.
Unfortunately, we encountered the same issue while deploying a Kubernetes cluster on baremetal servers from Hetzner with the cluster-api-provider-hetzner.
We have a server of type AX41-NVMe with a single network interface, and the technical details of the server are successfully obtained, and the subsequent bootstrap completes successfully.
However, we also have different servers of types EX130-R/EX130-S, which have two network interfaces:
root@rescue ~ # lspci | grep net | awk '{$1=$2=$3=""; print $0}' | sed "s/^[ \t]*//"
Intel Corporation Ethernet Controller X550 (rev 01)
Intel Corporation Ethernet Controller X550 (rev 01)
Similar to the example @alexkasatikov we have logs from caph-controller-manager:
{
"level": "ERROR",
"time": "2024-03-10T19:02:52.056Z",
"file": "controller/controller.go:329",
"message": "Reconciler error",
"controller": "hetznerbaremetalhost",
"controllerGroup": "infrastructure.cluster.x-k8s.io",
"controllerKind": "HetznerBareMetalHost",
"HetznerBareMetalHost": {
"name": "infra-dev-02-worker-bm-2332683",
"namespace": "default"
},
"namespace": "default",
"name": "infra-dev-02-worker-bm-2332683",
"reconcileID": "5b2d4c5a-f010-42df-8532-8c1388861c86",
"error": "failed to reconcile HetznerBareMetalHost default/infra-dev-02-worker-bm-2332683: action
\"registering\" failed: failed to get hardware details: failed to obtain hardware details Nics: failed to
unmarshal {\"name\":\"eth0\",\"model\":\"Intel Corporation Ethernet Controller X550 (rev 01)}. Original ssh
output name=\"eth0\" model=\"Intel Corporation Ethernet Controller X550 (rev 01)\\nIntel Corporation
Ethernet Controller X550 (rev 01)\" mac=\"a8:a1:59:fb:c4:db\" ip=\"37.27.63.175/26\"
speedMbps=\"1000\"\\nname=\"eth0\" model=\"Intel Corporation Ethernet Controller X550 (rev 01)\\nIntel
Corporation Ethernet Controller X550 (rev 01)\" mac=\"a8:a1:59:fb:c4:db\" ip=\"2a01:4f9:3081:310e::2/64\"
speedMbps=\"1000\": unexpected end of JSON input",
"stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.
(*Controller).reconcileHandler\\n\\tsigs.k8s.io/controller-
runtime@v0.16.3/pkg/internal/controller/controller.go:329\\nsigs.k8s.io/controller-
runtime/pkg/internal/controller.(*Controller).processNextWorkItem\\n\\tsigs.k8s.io/controller-
runtime@v0.16.3/pkg/internal/controller/controller.go:266\\nsigs.k8s.io/controller-
runtime/pkg/internal/controller.(*Controller).Start.func2.2\\n\\tsigs.k8s.io/controller-
runtime@v0.16.3/pkg/internal/controller/controller.go:227"
}
This turned out to be a significant issue for us, as our production cluster building process encountered this problem. We would greatly appreciate it if you could find a way to fix this problem.
Environment:
@Lenikas is it possible to schedule a call for further debugging?
@guettli please have a look into this in the upcoming week.
@Lenikas can you please post the output of these commands:
ip a
ethtool "*"
lspci
thank you!
@Lenikas can you please post the output of these commands:
ip a
ethtool "*"
lspci
thank you!
Hello @guettli, thank you for replying!
This is output from server EX130-R type:
If it's important, we use custom server versions with various options. If needed, I can probably provide configuration options.
@Lenikas is it possible to schedule a call for further debugging?
@guettli please have a look into this in the upcoming week.
Hello @batistein, @guettli!
If relevant, we can schedule a meeting. Alternatively, we can suggest transitioning our communication to a different platform if it's more convenient for you. Additionally, we can grant you SSH access to the server for debugging purposes.
How long do you think it might take to resolve the issue? It's important for our team to understand this to plan our next steps. Unfortunately, our team lacks sufficient expertise in Go to quickly resolve this issue.
If you need any further information, we're ready to provide it.
Thank you!
@Lenikas please sent me an email at: info@syself.com
@Lenikas we created a draft which should make the error go away.
Do you need the NIC data which gets gathered by the script? Because at the moment the script nic-info.sh does not work reliably. But I guess you don't need these values, and you just want the provisioning to succeed.
@guettli Yes, at the moment, we simply need a fix to ensure that provisioning completes successfully.
However, we are unsure where this information may be needed in the future. Perhaps you have some ideas or is it related to some functionality of the cluster-api-provider-hetzner?
Thank you for the responsive communication!
@Lenikas the PR is merged, you can test the new caph image by updating the caph deployment in your management cluster.
Image: ghcr.io/syself/caph-staging:sha-c6fd5bb
Please tell us if this works for you. Thank you.
@Lenikas we just released a new version of caph. Should be now usable with clusterctl as well.
@guettli Hello I apologize for the delayed response.
Yes, I have checked the built image, it works. The provisioning completes successfully, and the nodes are added to the cluster.
Thank you so much!
/kind bug
What steps did you take and what happened: I'm trying to set up k8s cluster with only one node using hetzner-baremetal-control-planes flavor. After generating cluster and adding HetznerBareMetalHost I don't see any detail about host hardware when doing
kubectl describe hetznerbaremetalhost
. Here is the log from caph-controller-manager:Log
{ "level": "ERROR", "time": "2023-09-21T11:18:53.496Z", "file": "controller/controller.go:324", "message": "Reconciler error", "controller": "hetznerbaremetalhost", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HetznerBareMetalHost", "HetznerBareMetalHost": { "name": "de1459", "namespace": "de-dev" }, "namespace": "de-dev", "name": "de1459", "reconcileID": "9283fd2c-9da9-4274-aaae-ffbea85dbf64", "error": "failed to reconcile HetznerBareMetalHost de-dev/de1459: action \"registering\" failed: failed to get hardware details: failed to obtain hardware details Nics: failed to unmarshal {\"name\":\"eth0\",\"model\":\"Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)}. Original ssh output name=\"eth0\" model=\"Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)\nIntel Corporation I350 Gigabit Network Connection (rev 01)\" mac=\"f0:2f:74:94:a2:41\" ip=\"162.55.151.48/26\" speedMbps=\"1000\"\nname=\"eth0\" model=\"Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)\nIntel Corporation I350 Gigabit Network Connection (rev 01)\" mac=\"f0:2f:74:94:a2:41\" ip=\"2a01:4f8:262:265f::2/64\" speedMbps=\"1000\": unexpected end of JSON input", "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/src/cluster-api-provider-hetzner/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226" }What did you expect to happen: Reconcilation completed successfully
Anything else you would like to add: I assume that's due to this line: https://github.com/syself/cluster-api-provider-hetzner/blob/v1.0.0-beta.22/pkg/services/baremetal/client/ssh/ssh_client.go#L144 When executed on host, it returns 2 lines:
and the script output is like that:
Environment:
/etc/os-release
): debian 12