timberland-sig / dracut

dracut the event driven initramfs infrastructure
https://github.com/dracutdevs/dracut/wiki
GNU General Public License v2.0
0 stars 0 forks source link

fix(nvmf): set netroot=nbft #10

Closed mwilck closed 1 year ago

mwilck commented 1 year ago

The logic added in 9b9dd99 ("35network-legacy: only skip waiting for interfaces if netroot is set") will cause all NBFT interfaces to be waited for unless the "netroot" shell variable is set. Avoid this by setting "netroot=nbft": this will cause the boot to proceed even NBFT interfaces are missing, as long as the initrd root file system has been found.

This requires installing a netroot handler /sbin/nbftroot, which will be called by the networking scripts via /sbin/netroot when the interface has been brought up. Create a simple nbftroot script that simply calls nvmf-autoconnect.sh. With this installed, we can skip calling nvmf-autoconnect.sh from the "online" initqueue hook.

Fixes #9, but only for the network-legacy networking backend.

I think that with the network-manager backend, the issue doesn't exist in the first place.

mwilck commented 1 year ago

As discussed on the last Timberland meeting, I double-checked the network-manager backend, too, and updated the PR description.

Elaborating some more, NM doesn't use finished initqueue scripts for individual interfaces at all. Rather, it uses nm-wait-online-initrd.service, which calls nm-online -s -q -t 3600. I don't understand the semantics of this tool exactly, but the man page says "nm-online waits until NetworkManager reports an active connection, or specified timeout expires". Reporting of an "active connection" depends on the autoconnect and ipv4.may-fail and ipv6.may-fail settings (and perhaps more, again I don't fully understand it) of the configured connections [^1], but unless I am mistaken, NM will signal an "active connection" (and thus, nm-online will return success) as soon as one network connection becomes active [^2].

Therefore I think the "problem" that inactive interfaces will be waited for in the "NVMe/TCP multipath" case does not exist with NM. The second problem described in #9 (second interface not up after boot) might very well exist, too.

@johnmeneghini, @tbzatek: could you discuss this with NM experts for confirmation?

The netroot parameter is used by NM, and thus I think this PR won't cause a regression.

[^1]: The connections and their settings are generated by the nm-initrd-generator tool.

[^2]: Fixme: does routing play a role here? would NM look for a route to the pubklic internet, like the infamous "connectivity check" known from the desktop?

thom311 commented 1 year ago

I am not familiar with this topic, so I cannot give a qualified review.

Only a comment about NetworkManager...

but the man page says "nm-online waits until NetworkManager reports an active connection, or specified timeout expires".

This quote from man nm-online is mainly about how the tool behaves when called without --wait-for-startup. Which isn't relevant here. nm-online tool is almost not useful on it's own (the manual even says that). The relevant part is that it's called as implementation detail by NetworkManager-wait-online.service (in real-root) and nm-wait-online-initrd.service (in initrd)

man NetworkManager-wait-online.service (here) better explains how this is supposed to work.

NM will signal an "active connection" (and thus, nm-online will return success) as soon as one network connection becomes active

NetworkManager-wait-online.service (and nm-wait-online-initrd.service and nm-online -s) will wait until NetworkManager indicates that it is done configuring the network. You can affect that via various means (listed in the manual page), but among others, it will wait until all interfaces that are supposed to be configured, are configured. That is, as long as you see devices in "activating"/"connecting" state in nmcli device, NetworkManager is not yet done configuring the network and the tools still wait for online.

bengal commented 1 year ago

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service. In this way, the dracut initqueue (which runs nm-run.sh, and basically executes the online and netroot hooks) starts only after all interfaces that need configuration are activated or failed to activate. The interfaces that need configuration are the ones for which nm-initrd-generator created a profile from the command line.

I'm not sure how the logic implemented in 9b9dd9993e645f97c176f1db707d41062135cf8e is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

mwilck commented 1 year ago

@thom311, @bengal, thanks for your comments.

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service.

So this differs from the way network-legacy works, where network interface activation / configuration is done as part of the initqueue processing. That won't make it easier for us, unfortunately.

In this way, the dracut initqueue (which runs nm-run.sh, and basically executes the online and netroot hooks) starts only after all interfaces that need configuration are activated or failed to activate. The interfaces that need configuration are the ones for which nm-initrd-generator created a profile from the command line.

For NBFT boot, the nvmf module generates ip=... cmdline arguments which (to my understanding) are converted to NM profiles by nm-initrd-generator. Thus, IIUC NM would wait for each interface before even starting the initqueue. Right?

I'm not sure how the logic implemented in 9b9dd99 is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

Yeah, it probably won't work this way. OTOH, you said NM waits until all interfaces are "activated or failed to activate". If an interface is unplugged, I suppose NM would wait for some time (probably connection.wait-device-timeout), and set the interace to "failed to activate" afterwards. Which would mean that initqueue could proceed.

I guess someone needs to just test this. @johnmeneghini, can you do this with the rh-poc?

The almost correct behavior in the multipath case would be to wait forever until at least one interface is up, and once this happens, stop waiting for any other interfaces. The problem with this is that if there are multiple interfaces, you don't know if it's just multipath, or if different devices are accessed via different network / NVMe connections. But I guess we can ignore that for the time being.

The really correct behavior (IMHO) would be to wait for connections and the root FS at the same time, and once all devices necessary to mount the root FS[^fs] are detected, stop waiting for any other interfaces. This is basically how the legacy module behaves with this PR.

I have now idea if, and how, that could be achieved with the dracut networkmanager module.

May I ask whether you have discussed this issue in the context of iSCSI/iBFT multipath boot, and whether you have found a solution for that?

[^fs]: and other mandatory file systems

mwilck commented 1 year ago

Side note to @thom311: NM will also need support for NBFT-configured interfaces at run time (in the real root FS):

So far we have implemented this "feature set" in the SUSE tool "wicked". For wicked, I've written a shell-script plugin which reads the JSON-formatted HFI information from the NBFT and transforms it into XML that wicked understands. I suppose some similar approach would be possible for NM. NM has been on my todo list but I haven't had time to actually work on it. I've also repeatedly mentioned in Timberland meetings that this is a necessary puzzle piece to make NVMe boot production ready for NM-based systems. Some hints, or better even someone else looking to this with my support, would be much appreciated.

bengal commented 1 year ago

The network-manager module works by starting NetworkManager as systemd service, and having a nm-wait-online-initrd service that orders itself Before=dracut-initqueue.service.

So this differs from the way network-legacy works, where network interface activation / configuration is done as part of the initqueue processing. That won't make it easier for us, unfortunately.

Right.

For NBFT boot, the nvmf module generates ip=... cmdline arguments which (to my understanding) are converted to NM profiles by nm-initrd-generator. Thus, IIUC NM would wait for each interface before even starting the initqueue. Right?

That's correct.

There is a dracut PR (https://github.com/dracutdevs/dracut/pull/2173) to change this a bit, and run the hooks as soon as each interface is activated; but that doesn't change the fact that the initqueue runs after all interfaces are activated.

I'm not sure how the logic implemented in 9b9dd99 is going to work with NM, because the synchronization mechanism used by NM (via nm-wait-online-initrd) doesn't have that shortcut.

Yeah, it probably won't work this way. OTOH, you said NM waits until all interfaces are "activated or failed to activate". If an interface is unplugged, I suppose NM would wait for some time (probably connection.wait-device-timeout), and set the interace to "failed to activate" afterwards. Which would mean that initqueue could proceed.

I'm not sure if by "unplugged" you mean with the cable unplugged (i.e. without carrier), or that the device is physically unplugged from the system (i.e. not present at all). In the first case there is a carrier-timeout of 10 seconds, in the second case the timeout for the device to appear is 60 seconds (only when neednet=1 or when the device is the bootdev). After the timeout expires, the initqueue proceeds.

The really correct behavior (IMHO) would be to wait for connections and the root FS at the same time, and once all devices necessary to mount the root FS1 are detected, stop waiting for any other interfaces. This is basically how the legacy module behaves with this PR.

I have now idea if, and how, that could be achieved with the dracut networkmanager module.

I guess that would require:

May I ask whether you have discussed this issue in the context of iSCSI/iBFT multipath boot, and whether you have found a solution for that?

I am not aware of any previous discussion about this or similar issues.

mwilck commented 1 year ago

I'm not sure if by "unplugged" you mean with the cable unplugged

I meant "no carrier", or "down" for whatever other reason (e.g. no IP address obtained from DHCP). No hardware hot-plug discussion here :-)

mwilck commented 1 year ago

At the moment I don't know how to do that, but there is a way probably.

Why did you make nm-wait-online-initrd a prerequisite for starting the initqueue in the first place? NM could be started in parallel with the initqueue and use some "finished" initqueue script to signal dracut that network setup is ready. So you must have had some strong reason not to do it that way, at we'd need to understand what it was to avoid regressions.

I am not aware of any previous discussion about this or similar issues.

Hm. Strange. iSCSI multipath boot would have exactly the same problem. We have found a solution with network-legacy only quite recently, too. Perhaps people just don't use this technology.

thom311 commented 1 year ago

it should understand that these interfaces should not be reconfigured or shut down, as they are necessary to access the > root FS, however, it must take care of some things, such as DHCP release renewal,

That is not different from other networking which is setup by NM in initrd (iBFT). Interestingly, NetworkManager to this day doesn't support something like systemd-networkd's KeepConfiguration= setting (seems the demand is not high enough for anybody working on that??). In any case, while useful/necessary, it would be orthogonal to an NBFT feature.

mwilck commented 1 year ago

That is not different from other networking which is setup by NM in initrd (iBFT)

right, it is not. But I guess someone needs to code the plugin :-) I'll have a look at NM's iBFT code and see to which extent it can be reused for NBFT support.

thom311 commented 1 year ago

the iBFT code for NetworkManager is here.

mwilck commented 1 year ago

@bengal, @johnmeneghini: acceptance of this PR is currently blocking the upstream dracut PR for timberland. Can we agree to merge this into timberland_final branch now, acknowledging that it may be necessary to apply further changes to the dracut NM module?

bengal commented 1 year ago

Why did you make nm-wait-online-initrd a prerequisite for starting the initqueue in the first place? NM could be started in parallel with the initqueue and use some "finished" initqueue script to signal dracut that network setup is ready. So you must have had some strong reason not to do it that way, at we'd need to understand what it was to avoid regressions.

There might have been other reasons that I don't remember, but I think the main one was to leave the hooks invocation in the initqueue, and only use unit dependencies as synchronization mechanism to ensure hooks are invoked only after the network is configured. In this way there is no need for custom scripts and everything works similarly as in the real root, using the network-online target. This can be revisited if there are issues not solvable with the current approach.

I am not aware of any previous discussion about this or similar issues.

Hm. Strange. iSCSI multipath boot would have exactly the same problem. We have found a solution with network-legacy only quite recently, too. Perhaps people just don't use this technology.

One problem in dracut is that there is no documentation or knowledge about supported use cases and this makes it difficult to introduce new features or do changes. It would be great if every use case would be covered by the test suite (see the test/ directory in the dracut tree). NetworkManager also tests different dracut scenarios in the integration test suite and it tries to cover most of the known use cases.

bengal commented 1 year ago

@bengal, @johnmeneghini: acceptance of this PR is currently blocking the upstream dracut PR for timberland. Can we agree to merge this into timberland_final branch now, acknowledging that it may be necessary to apply further changes to the dracut NM module?

This makes sense to me.

johnmeneghini commented 1 year ago

The almost correct behavior in the multipath case would be to wait forever until at least one interface is up, and once this happens, stop waiting for any other interfaces. The problem with this is that if there are multiple interfaces, you don't know if it's just multipath, or if different devices are accessed via different network / NVMe connections. But I guess we can ignore that for the time being.

This is a policy decision. We can't wait forever. This looks like a hung system. It is better to fail to boot and let the user intervene. The nbft table has a timeout. This can be used by the user set the timeout policy. If the use want to wait forever during boot, they can use this timeout to set the policy.

johnmeneghini commented 1 year ago

I think are ready go move forward with the upstream dracut pull request. Please go ahead and merge this change and then move forward with the upstream pull request.

mwilck commented 1 year ago

This is a policy decision. We can't wait forever. This looks like a hung system

dracut's default is to wait forever for the root FS. You can question whether that makes sense, but I don't think we should use a different default.

mwilck commented 1 year ago

Note: I squashed the changes from this PR into the top commit of the timberland_final branch. I also updated the commit message to reflect the changes made by this PR.

Hash before squash: ac66c00, after squash: f58e1d5

johnmeneghini commented 1 year ago

This is a policy decision. We can't wait forever. This looks like a hung system

dracut's default is to wait forever for the root FS. You can question whether that makes sense, but I don't think we should use a different default.

I've been testing this and I see what you mean. I test things by toggling one or both of my nvme/tcp target port networks up and down on the target machine and then watching how the host reacts. When booting for the first time I see that UEFI will use the programmed timeout from NBFT. After timing out it returns to the Boot Menu. However, when I run the same test using a host reboot it hangs forever. I assume this is because a warm reboot is using initramfs and dracut is simply waiting forever, until I bring the IP link up on the nvme-tcp target port it's waiting for. Then is connects and boots. From what I can see dracut will not try to use the alternate path in this situation. I always hangs on the first path. I and bring the second path up and down and the host never see it. It hangs trying to boot from the first path.... forever.

The firmware appears to do the same thing. So it looks like we still have some path ordering issues in EDK2, and in dracut.

mwilck commented 1 year ago

When booting for the first time I see that UEFI will use the programmed timeout from NBFT.

I think you mean the ConnectTimeout from the UEFI input file, but AFAIU that's only effective for the firmware; there is no corresponding field in the NBFT.

After timing out it returns to the Boot Menu.

So this was with both interfaces down?

However, when I run the same test using a host reboot it hangs forever. I assume this is because a warm reboot is using initramfs and dracut is simply waiting forever, until I bring the IP link up on the nvme-tcp target port it's waiting for.

Hm, I can't quite follow. Are you talking about a host reset from the BIOS menu? If yes, do you see the grub menu / the kernel booting? I would assume that a host reset goes through the BIOS, and would behave just like the first time boot. Again, is this with one or two devices down?

From what I can see dracut will not try to use the alternate path in this situation. I always hangs on the first path

If it's hanging in dracut with one interface up and one down, you're observing Problem 1 from https://github.com/timberland-sig/dracut/issues/9. Which would indicate that there's indeed work to do for NM to make multipath boot work.