tinkerbell / hook

In-memory Operating System Installation Environment for Executing Tinkerbell Workflows
Apache License 2.0
104 stars 51 forks source link

HookOS stuck in during EKSA bare metal cluster creation with Dell PowerEdge XE8640 #243

Closed ygao-armada closed 1 month ago

ygao-armada commented 1 month ago

I try to create a EKSA bare metal cluster with Dell PowerEdge XE8640, and see the boots logs stuck with this:

{"level":"info","ts":1728196035.7770376,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: Next server: 10.10.0.100\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196035.7770746,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: Filename: http://10.10.0.100/auto.ipxe\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196035.7771587,"caller":"job/job.go:145","msg":"discovering from ip","service":"github.com/tinkerbell/boots","ip":"10.10.0.108"}
{"level":"info","ts":1728196035.777499,"caller":"httplog/httplog.go:37","msg":"","service":"github.com/tinkerbell/boots","pkg":"http","event":"ss","method":"GET","uri":"/auto.ipxe","client":"10.10.0.108","duration":0.000492178,"status":200}
{"level":"info","ts":1728196035.783313,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: http://10.10.0.100/auto.ipxe... ok\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196035.8207734,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: auto.ipxe : 1082 bytes [script]\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196035.8364809,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: Tinkerbell Boots iPXE\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196035.8547955,"caller":"job/job.go:145","msg":"discovering from ip","service":"github.com/tinkerbell/boots","ip":"10.10.0.108"}
{"level":"info","ts":1728196035.854997,"caller":"httplog/httplog.go:37","msg":"","service":"github.com/tinkerbell/boots","pkg":"http","event":"ss","method":"POST","uri":"/phone-home","client":"10.10.0.108","duration":0.000282288,"status":200}
{"level":"info","ts":1728196035.861805,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: http://10.10.0.100/phone-home... ok\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196035.937471,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: https://anywhere-assets.eks.amazonaws.com/releases/bundles/59/artifacts/hook/9d54933a03f2f4c06322969b06caa18702d17f66/vmlinuz-x\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196037.657374,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: 86_64... ok\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196038.013281,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: https://anywhere-assets.eks.amazonaws.com/releases/bundles/59/artifacts/hook/9d54933a03f2f4c06322969b06caa18702d17f66/initramfs\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}
{"level":"info","ts":1728196052.4838269,"caller":"syslog/receiver.go:113","msg":"host=10.10.0.108 facility=kern severity=INFO app-name=eksa-cp01 msg=\" ipxe: -x86_64... ok\"","service":"github.com/tinkerbell/boots","pkg":"syslog"}

And in the Virutal Console of iDRAC, the last messages are:

4933a03f2f...d17f66/initramfs-x86_64... ok
EFI stub: Loaded initrd from command line option

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Context

Your Environment

jacobweinstock commented 1 month ago

Hey @ygao-armada , i believe we resolved this via slack. If not please do reopen. The resolution was to enable NIC drivers in the kernel menuconfig. This one, https://www.kernelconfig.io/config_mlx5_core?arch=x86&kernelversion=6.6.54 and this one, https://www.kernelconfig.io/config_net_vendor_nvidia?arch=x86&kernelversion=6.6.54.

Thanks!

akshay8043 commented 1 week ago

Hi,

I am facing same issues. Can you suggest, how this was done?

We have Intel NIC E810 on worker nodes and we are facing same issues?

Control plane node is deployed with Broadcom NIC, but worker node has intel NIC isn't working.

ygao-armada commented 1 week ago

Hi,

I am facing same issues. Can you suggest, how this was done?

We have Intel NIC E810 on worker nodes and we are facing same issues?

Control plane node is deployed with Broadcom NIC, but worker node has intel NIC isn't working.

You need to build a custom HookOS with devices (mentioned by Jacob above) included, according to: https://anywhere.eks.amazonaws.com/docs/getting-started/baremetal/customize/bare-custom-hookos/

akshay8043 commented 6 days ago

@ygao-armada Let me have a look at that.

Although, out of curiosity, I believe you got a different resolution. _i believe we resolved this via slack. If not please do reopen. The resolution was to enable NIC drivers in the kernel menuconfig. This one, https://www.kernelconfig.io/config_mlx5_core?arch=x86&kernelversion=6.6.54 and this one, https://www.kernelconfig.io/config_net_vendor_nvidia?arch=x86&kernelversion=6.6.54._

If our issues are same, would you let me know, why your solution is different then mine, just so i can learn more about potential issues, it might occur for our systems.

ygao-armada commented 6 days ago

@akshay8043 oh, thanks for your help, especially from @jacobweinstock , in the custom HookOS, you may need to enable more, here is my list: Mellanox 5th generation network adapters (ConnectX series) core driver Mellanox Technologies Innova support Mellanox 5th generation network adapters (ConnectX series) Ethernet support Mellanox 5th generation network adapters (connectX series) IPoIB offloads support Mellanox Technologies subfunction device support using auxiliary device NVIDIA devices

ygao-armada commented 6 days ago

It works for me already.