ipxe / ipxe

iPXE network bootloader
https://ipxe.org
Other
1.52k stars 650 forks source link

Issue with 25 gigabit Intel 810 seemingly caused by cad1cc6 (100 gigabit driver) #1115

Open nshalman opened 10 months ago

nshalman commented 10 months ago

I have lots of reports of weird behavior from machines with Intel 810 NICs (Specifically on the machine where I did my testing, it reports as Intel Ethernet Controller E810-XXV for SFP)

I have bisected the issue down to cad1cc6 ("[intelxl] Add driver for Intel 100 Gigabit Ethernet NICs")

Notably, these are 25 gigabit cards, not 100 gigabit.

❯ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [9062544f6a0c69c249b90d21a08d05518aafc2ec] [efi] Disable EFI watchdog timer when shutting down to boot an OS
git bisect good 9062544f6a0c69c249b90d21a08d05518aafc2ec
# status: waiting for bad commit, 1 good commit known
# bad: [fa62213231a882eb6bbcefa7ad1106bdb9aaeae2] [smbios] Support scanning for the 64-bit SMBIOS3 entry point
git bisect bad fa62213231a882eb6bbcefa7ad1106bdb9aaeae2
# bad: [68734b9a4dafa540e5333d7af3849b59a10f7a93] [efi] Bind to only the topmost instance of the SNP or NII protocols
git bisect bad 68734b9a4dafa540e5333d7af3849b59a10f7a93
# bad: [856ffe000e79a1af24ea11301447dd70b8d54ac2] [ena] Limit submission queue fill level to completion queue size
git bisect bad 856ffe000e79a1af24ea11301447dd70b8d54ac2
# good: [7e9631b60fdcb02f05a80983ca68c10f26e4ab33] [utf8] Add UTF-8 accumulation self-tests
git bisect good 7e9631b60fdcb02f05a80983ca68c10f26e4ab33
# good: [1b61c2118ca54a8d9ad71cc402e7c9f6094f4ec6] [intelxl] Fix invocation of intelxlvf_admin_queues()
git bisect good 1b61c2118ca54a8d9ad71cc402e7c9f6094f4ec6
# good: [99242bbe2ead2d36eff65aefc2251e822cc4b2c6] [intelxl] Always issue "clear PXE mode" admin queue command
git bisect good 99242bbe2ead2d36eff65aefc2251e822cc4b2c6
# bad: [cad1cc6b449b63415ffdad8e12f13df4256106fb] [intelxl] Add driver for Intel 100 Gigabit Ethernet NICs
git bisect bad cad1cc6b449b63415ffdad8e12f13df4256106fb
# good: [06467ee70fd4750ecd2ae324f66055ff261cb713] [intelxl] Defer fetching MAC address until after opening admin queue
git bisect good 06467ee70fd4750ecd2ae324f66055ff261cb713
# good: [6871a7de705b6f6a4046f0d19da9bcd689c3bc8e] [intelxl] Use admin queue to set port MAC address and maximum frame size
git bisect good 6871a7de705b6f6a4046f0d19da9bcd689c3bc8e
# first bad commit: [cad1cc6b449b63415ffdad8e12f13df4256106fb] [intelxl] Add driver for Intel 100 Gigabit Ethernet NICs

Please let me know what additional debugging information I can collect to further diagnose this issue.

nshalman commented 10 months ago

Revert of that commit (specifically as performed in https://github.com/nshalman/ipxe/commit/841d1cda0072e4259d09e36ac4df1fc0914cdb7a ) does seem to alleviate the issue in my initial testing.

NiKiZe commented 10 months ago

What if you drop the pciid of your NIC from the sources? And do you have any example of what kind of behaviour?

Which build target are you using? I would assume EFI and ipxe.efi? If that is the case have you tried snponly.efi or snp.efi binaries?

foyerunix commented 3 months ago

Hello,

In our use case we use IPXe to boot on servers with Intel E810 25Gb SFP NIC connected to switches with LACP enabled. Our switches doesn't support any LACP fallback mode, so if IPXe doesn't establish the LACP session the install will fail.

With the current IPXe code we cannot establish a LACP session, therefore the install fail. If we disable LACP on the switches, the install complete as expected.

I can confirm that by commenting the following line, the installation will complete with LACP enabled:

https://github.com/ipxe/ipxe/blob/d2d194bc60f012569fa95ed54693cb6663beb5ce/src/drivers/net/ice.c#L963

Best Regards.

redat00 commented 2 weeks ago

Hi all !

We also faced issues with the exact same card (e810-xxv-sfp), connected to a 25G link. We can confirm that removing the said commit (by removing the code) and building it again, ends up solving the issue. I suppose also removing it from the ice.c mapping (based on the pciid) will also solve it.

The issue we encounter is the following :

  1. Our network card is able to DHCP, and retrieve the ipxe.efi from TFTP.
  2. Then, once iPXE start, if we let him do the DHCP all by himself (running the dhcp command in the embedded script) the network card get totally disconnected from our network. We only see a DHCPDISCOVER and a DHCPOFFER on the DHCP server, running tcpdump. Even the BMC that is bridged over the interface of the network card gets disconnected too.
  3. After a few seconds, it comes back online, but of course telling us that DHCP was unsuccessful.

We're building iPXE using the following command :

make EMBED=script.ipxe bin-x86_64-efi/ipxe.efi -j 8

If you need any more information, just let us know, we'll be more than happy to assist.