linux-nvme / nvme-cli

NVMe management command line interface.
https://nvmexpress.org
GNU General Public License v2.0
1.45k stars 653 forks source link

nbft: Always stick to HFI-defined host_traddr #2180

Open tbzatek opened 8 months ago

tbzatek commented 8 months ago

https://github.com/linux-nvme/nvme-cli/blob/dcdad6f5d70ffb2fa151f229db048180671eb1fe/nbft.c#L162-L190

I'm not sure this is 100% correct. The possibility of receiving different IP address between DHCP requests (UEFI vs. linux or simply different DHCP clients) is valid and should be covered.

However, consider the following scenario:

Setting proper host_traddr for connection attempts is crucial to indicate the source interface. Once omitted by the mentioned code snippet, connection might be established from a different interface, ignoring the desired HFI. In case of incorrect routing the second target may be silently reached from a different network interface than expected.

This needs properly parsed HFI indexes first, see https://github.com/linux-nvme/libnvme/pull/766

martin-belanger commented 8 months ago

Hi @tbzatek - Regarding the host_traddr parameter:

Setting proper host_traddr for connection attempts is crucial to indicate the source interface.

For TCP (and TCP only), host_traddr is not used to select the "interface" but to select the "source address" of the connection. One must use host_iface to select the interface. host_traddr is used when an interface has multiple addresses configured (a primary and 1 or more secondary) and we want to prevent the primary address from being used for the connection.

tbzatek commented 8 months ago

For TCP (and TCP only), host_traddr is not used to select the "interface" but to select the "source address" of the connection. One must use host_iface to select the interface. host_traddr is used when an interface has multiple addresses configured (a primary and 1 or more secondary) and we want to prevent the primary address from being used for the connection.

Ah, thanks, I'm missing the low-level insight really. Still, wouldn't the end result effectively be equal, i.e. the kernel would use a network interface the host_traddr belongs to? Setting host_iface in userspace might involve some extra lookup.

Anyway, I think we'll first need to solve https://github.com/timberland-sig/edk2/issues/29 and https://github.com/linux-nvme/libnvme/pull/766 before fixing the loose end here.

martin-belanger commented 8 months ago

Still, wouldn't the end result effectively be equal, i.e. the kernel would use a network interface the host_traddr belongs to?

Not quite. When host_iface is not specified, the kernel looks up the destination address (traddr) in the routing table to determine the best interface for the connection. After the interface has been determined, then the kernel will try to set the source address to host_traddr, if specified. Otherwise, it sets the source address to the primary address assigned to that interface. Note that if host_traddr is specified but is invalid (i.e. it does not belong to the selected interface), then the whole operation will fail.

Setting host_iface in userspace might involve some extra lookup.

Correct! If all you have is a source address and you want to use that to force the connection on a specific interface regardless of the routing tables, then you need to find out which interface owns that address and use it with host_iface (check nvme_iface_matching_addr() for an example).

mwilck commented 8 months ago

I don't we should completely remove the fallback code. Doing that would strongly increase the risk of boot failures, which is never a good thing.

However, I would suggest to walk the list of HFIs and use the fallback (without host_iface / host_traddr) only if connect attempts through all HFIs have failed. That would currently come down to falling back on every failed connection, because we haven't seen a correctly populated secondary HFI list so far.

This would only work if multipath was configured in the NBFT via secondary HFI association. In theory, the NVMe boot spec allows 3 ways to set up multipath:

  1. secondary HFI association,
  2. multiple SNSS entries referring to different HFIs via the primary HFI field (this is the most likely setup in practice these days),
  3. multiple NBFT tables listing the same SNSS with different HFIs.

My proposed suggestion would cover only case 1. Covering the other cases would be complicated because we'd need to collect all NBFT data and do "deduplication" on them somehow before attempting to connect.

tbzatek commented 8 months ago

When host_iface is not specified, the kernel looks up the destination address (traddr) in the routing table to determine the best interface for the connection. After the interface has been determined, then the kernel will try to set the source address to host_traddr, if specified. Otherwise, it sets the source address to the primary address assigned to that interface.

Thanks Martin, useful insight! In our case, the host_traddr should equal to the primary address of the network interface. At least for static IP address assignments.

In case of DHCP, the HFI IP address (as being assigned to the EFI DHCP client) may differ from what's actually assigned to the particular network interface in OS stage (due to DHCP client differences and different identifiers sent out). We don't want the kernel host driver to look up a different interface that is best for the traddr, we want to stick to the selected interface (as defined by HFI), only need to omit the specified host_traddr if it doesn't match what's actually available in the system. We want to let the connection fail rather than letting the kernel to select a foreign interface (e.g. due to messed-up routing) and keep user in a false sense of security (multipath).

At least that's my proposal, if it makes sense.

Note that if host_traddr is specified but is invalid (i.e. it does not belong to the selected interface), then the whole operation will fail.

Yes, this is what I'm currently seeing due to the mixed-up HFI indexes in the NBFT table, i.e. wrong host_traddr specified from a foreign HFI/network interface. Caused by the above mentioned bugs.

Setting host_iface in userspace might involve some extra lookup.

Correct! If all you have is a source address and you want to use that to force the connection on a specific interface regardless of the routing tables, then you need to find out which interface owns that address and use it with host_iface (check nvme_iface_matching_addr() for an example).

Nice, I think this might be the way to go.

tbzatek commented 8 months ago

I don't we should completely remove the fallback code. Doing that would strongly increase the risk of boot failures, which is never a good thing.

Agree, the reason for the fallback is still valid and needed.

However, I would suggest to walk the list of HFIs and use the fallback (without host_iface / host_traddr) only if connect attempts through all HFIs have failed.

Agree with moving the fallback at the end, after all regular attempts. I would still like to keep the host_iface constraint, see above. Subject to further discussion.

That would currently come down to falling back on every failed connection, because we haven't seen a correctly populated secondary HFI list so far.

Yes, it will currently fail until the firmware and the libnvme sides are fixed. I would suggest to postpone the proper fix to nvme-cli until everything else is fixed.

This would only work if multipath was configured in the NBFT via secondary HFI association. In theory, the NVMe boot spec allows 3 ways to set up multipath:

  1. secondary HFI association,
  2. multiple SNSS entries referring to different HFIs via the primary HFI field (this is the most likely setup in practice these days),
  3. multiple NBFT tables listing the same SNSS with different HFIs.

My proposed suggestion would cover only case 1. Covering the other cases would be complicated because we'd need to collect all NBFT data and do "deduplication" on them somehow before attempting to connect.

Right, I think what I have been talking about so far was for the case 2. We'll need to find a fix that suits both scenarios.

stuarthayes commented 8 months ago

Setting host_iface in userspace might involve some extra lookup.

Correct! If all you have is a source address and you want to use that to force the connection on a specific interface regardless of the routing tables, then you need to find out which interface owns that address and use it with host_iface (check nvme_iface_matching_addr() for an example).

Nice, I think this might be the way to go.

Maybe we use the other information in the HFI TCP transport info descriptor (MAC address and/or PCI address) to look up and set host_iface in discover_from_nbft()?

mwilck commented 8 months ago

Maybe we use the other information in the HFI TCP transport info descriptor (MAC address and/or PCI address) to look up and set host_iface in discover_from_nbft()?

Yes, we should do that. It matters only for multipath scenarios, and I believe that there will be only a few (corner) cases where this makes an actual difference, but if we want to be correct, we have to do it.

@martin-belanger has pointed out that libnvme already has the logic to do this kind of interface matching when opening connections, so I would assume that this already works. Or am I missing something?

mwilck commented 7 months ago

Additional remarks based on a recent SUSE bug report. My memory has deceived me, I thougt we had made more progress in the meantime.

The host_traddr field is basically useless in NBFT/DHCP scenarios. The only clean way to solve this would be an implementation of TP8027 on both the Firmware and OS sides, which I don't expect in the near future.

The current state is:

IOW, for DHCP configured interfaces at least, we might as well ignore the host_traddr field completely, as the IP address will almost certainly be different between FW and OS.

[^o61]: Note that option 61 predates RFC 4361. It was already part of RFC 2131 in 1997.

tbzatek commented 7 months ago

Maybe we use the other information in the HFI TCP transport info descriptor (MAC address and/or PCI address) to look up and set host_iface in discover_from_nbft()?

Agree, this looks like a clean way.

The host_traddr field is basically useless in NBFT/DHCP scenarios. The only clean way to solve this would be an implementation of TP8027 on both the Firmware and OS sides, which I don't expect in the near future.

Right, as the address might be absent in the system, resolving corresponding interface might fail. In my original case, I still had the address matched, just mixed up due to wrong HFI indexes.

IOW, for DHCP configured interfaces at least, we might as well ignore the host_traddr field completely, as the IP address will almost certainly be different between FW and OS.

So until TP8027 is ready for use, do we agree to go with resolving host_iface from HFI MAC in the meantime?

I still think this won't currently work with EDK2 due to mixed HFI indexes (and also https://github.com/timberland-sig/edk2/issues/29), but this appears to be correct with Dell firmware from what I've seen so far.

tbzatek commented 5 months ago

FYI, for the discovery code in #2315 I retained the original workaround for the time being. There are now three places with the fallback.

igaw commented 2 months ago

Good to close?

tbzatek commented 2 months ago

Good to close?

Nope, this has not been implemented yet.