tinkerbell / hook

In-memory Operating System Installation Environment for Executing Tinkerbell Workflows
Apache License 2.0
105 stars 51 forks source link

Intel I225-LM not detected (old kernel issue most likely) #90

Closed jmpolom closed 2 years ago

jmpolom commented 3 years ago

Expected Behaviour

Hook detects and loads the igc module on systems with Intel I225-LM NICs present.

Current Behaviour

Hook boots but seemingly fails to detect Intel I225-LM devices and the igc module does not get loaded. This results in no network connectivity if a system is connected to the provisioning network via an interface with this chipset.

Possible Solution

Update the kernel. Hook uses 5.10.57 which is many releases behind upstream 5.10 LTS (currently at 5.10.78) and likely the cause here. Ideally, hook should move to using mainline releases for better compatibility with new hardware as it makes it into new kernel releases.

Steps to Reproduce (for bugs)

  1. Obtain a system with an Intel I225 interface. We tested the Minisforum HX90. Configure it so PXE is enabled.
  2. Establish tinkerbell services and connect device with I225 adapter to network with access to services
  3. Create a hardware profile and workflow for the device
  4. Power on device
  5. System will chainload iPXE from PXE and hook will load via iPXE
  6. Hook will begin to boot but eventually stall out due to having no network connectivity due to not having loaded a module to enable the I225 NIC.

Context

We are intending to use tinkerbell to deploy many client devices that unfortunately only have a single I225-LM NIC on them. We can sidestep this issue by not using hook and doing automated OS installs however that is going to be a slower option than using hook to deploy disk images. Ideally, hook should function on modern hardware.

To be sure, both Fedora 34 and Debian 11 were also tested on the system with an I225-LM NIC and both were able to detect the NIC and loaded the igc module. Whatever the issue is that affects hook supporting this devices seems to have been fixed in later LTS kernels and mainline.

Questions

Is there a specific reason hook has been held back to such a dated LTS kernel release? This definitely is going to hamper support of new hardware.

I noticed some patches so I could see those requiring some work to validate against or port to a newer kernel. I could see time being a constraint here. Maintaining a kernel build is certainly not a zero time commitment.

Your Environment

cc: @jkl92 @storrgie

thebsdbox commented 3 years ago

It looks like the module is actually included, this is an LTS kernel that has most general drivers included. We can look at adding the module directly compiled into the kernel. 5.10.x has just been superseded buy 5.15.x as the LTS branch.

jmpolom commented 3 years ago

It looks like the module is actually included, this is an LTS kernel that has most general drivers included.

I did notice that it was selected as a module in the current 5.10 config. Are all modules that are built as modules in the kernel config included in the initrd for hook? Or is it like building an initd for Fedora or Debian where modules that need to be included in the initrd need to be specified explicitly?

I haven't decompressed the initrd yet to have a look however I'm now thinking it would be worth looking.

My gut feeling is that there were changes upstream though after 5.10.57 or that had not yet been backported to .57 which are causing the issue here. I read in a number of places that certain variants of the I225-LM had different firmware that presented issues for drivers detecting the devices. I think updating kernel versions is the prudent thing to do here especially in light of the fact that 5.15 is the new LTS.

Are there plans to jump onto 5.15 for hook? Is there any particular complexity to bumping kernel versions in hook or is it mostly a matter of creating a config compatible with the new version?

tstromberg commented 2 years ago

AFAIK, no one has made an explicit plan to upgrade to 5.15, but in general we want to keep up to date with things. @thebsdbox - any thoughts on it?

I do note that 5.15 has an EOL of Oct 2023 - versus 5.10 having an EOL of Dec 2026.

jmpolom commented 2 years ago

EOL of Oct 2023 seems like it should be sufficient? I feel like kernel versions should be getting upgraded on a period of something like 1-2 years in order to ensure hook is compatible with the latest hardware.

If there isn't a plan to update to 5.15, were there any plans to bump point releases on the 5.10 kernel line? Right now 5.10.79 is the latest in that series which is a fair bit more up to date than where things are now.

jmpolom commented 2 years ago

@tstromberg is there anything we can do to move this along? I don't think we've tested bumping the kernel version to the latest 5.10.83 kernel but if that would be helpful please let me know.

Raj-Dharwadkar commented 2 years ago

@thebsdbox @tstromberg Is there anything that can be done here to proceed?

jmpolom commented 2 years ago

I can confirm that updating the kernel to 5.10.85 enables operation of the Intel I225-LM NIC and tinkerbell workflows can succeed using hook. PR sent.

In the interest of getting a fix out quickly, the hook kernel should be updated to the latest 5.10.x kernel to close this issue. Long term, we need to figure out a path forward to move to 5.15 which is the latest LTS kernel and also discuss how to keep up with upstream kernel releases.

GurubasavarajuMN1 commented 1 year ago

This issue is not yet fixed. I still get this issue with 5.10.85.

jmpolom commented 1 year ago

The tinkerbell project needs to do a better job keeping the kernels in hook up to date. The kernel I submitted a PR for was recent 2 years ago but is now quite outdated even within the 5.10 LTS kernel series. We are no longer looking to use the tinkerbell tooling in my organization so I don't have a need to work on this any longer.

There's enough info in the associated PR to figure out how to update the kernel to the latest point release for 5.10. You might consider repeating the exercise I undertook here as a quick fix. Long term hook needs to move on to a more recent LTS kernel. There have since been 2 new LTS kernel lines.