enable hardware flow offloading

nakato / nixos-bpir3-example

MIT License

24 stars 5 forks source link

enable hardware flow offloading #7

Closed ghostbuster91 closed 1 year ago

ghostbuster91 commented 1 year ago

Hi,

I was trying to get the hardware flow offloading using nftables to work but I was getting some error.

  > ruleset.conf:5:15-15: Error: Could not process rule: No such file or directory
       >     flowtable f {
       >               ^
       > ruleset.conf:13:28-42: Error: Could not process rule: No such file or directory
       >   ip protocol { tcp, udp } flow offload @f
       >                            ^^^^^^^^^^^^^^^

Frank, from bananapi forum, suggested that it might be that CONFIG_NF_FLOW_TABLE is not enabled. After looking at the config file I see that it is enabled though CONFIG_NF_FLOW_TABLE_INET is commented out.

Was there any reason to disable it? Can I simply uncomment it and assume that flow offloading should start to work?

nakato commented 1 year ago

Was there any reason to disable it?

It's disabled by default.

Can I simply uncomment it and assume that flow offloading should start to work?

It can be configured as a module. To build it in you'd need to change a lot more of the config.


-# CONFIG_NF_FLOW_TABLE_INET is not set
+CONFIG_NF_FLOW_TABLE_INET=m

nakato commented 1 year ago

It'll take some time to build, but I've updated packages.aarch64-linux.linuxPackages_bpir3 to use structuredExtraConfig, which means it has everything the NixOS distro kernel has, minus DRM (video), sound, and IB; this is what I'm now using on my devices now that I've placed them in-service. DRM, when I was working on risc-v took a ton of diskspace during build, so I'm just avoiding it here with the constrained memory and disk throughput of build hosts.

Technically you don't need to use a pkg for it, you can set config.boot.kernelPatches for it.

kernelPatches = [
  {
    name = "Patch";
    patch = ./some.patch;
  }
  {
    name = "Extra config not related to a patch";
    patch = null;
    structuredExtraConfig = with lib.kernel {
      PCIE_MEDIATEK = yes;
      ...
    }
  }
];

https://github.com/nakato/nixos-bpir3-example/blob/4210480bdebbf3a7953e22d5d9f183f47b725bff/pkgs/linux.nix#L39-L96

packages.aarch64-linux.linuxPackages_bpir3_minimal contains the prior configfile based derivation.

ghostbuster91 commented 1 year ago

It can be configured as a module. To build it in you'd need to change a lot more of the config.

Sorry for the delay, I didn't understand that at first and was thinking what to reply to you. But after your last answer it makes more sense :) I didn't even know that we were using some kind of minimal version of the kernel.

it has everything the NixOS distro kernel has, minus DRM (video), sound, and IB; this is what I'm now using on my devices now that I've placed them in-service.

Perfect, video, sound and IB are redundant in my use-case.

It took almost 9 hours on bpir3 to build it :D I will try to test the hw offloading soon and then I will close the issue. Thanks for the great work!

By the way, I would really like to be more helpful beyond just being a simple issue spammer. Could you recommend some resources that would help me gain a better understanding of these things? I was thinking about Linux From Scratch, but perhaps there is something better?

ghostbuster91 commented 1 year ago

I tried running bpir with the full kernel but it failed to mount the nvme disk. Looking at the code the PCI patch should be applied. Any idea what went wrong here? Screenshot from 2023-07-10 21-03-38

Here is the change: https://github.com/ghostbuster91/nixos-router/commit/a4843ff78be2458401322866f713655280440974

nakato commented 1 year ago

For NVMe during boot, try the following.

boot.initrd.availableKernelModules = [ "nvme" ];

I'm kind of surprised that's not in there by default.

That should work, but if it doesn't automatically load it as it should, then boot.initrd.kernelModules instead will explicitly make it load the kernel module in the initramfs.

If that fails, boot.kernelParams = [ "boot.shell_on_fail" ]; will tell the initramfs to drop you to a shell (with very few tools) on failure, which then means it'll be possible to use ls and cat to poke around /sys/bus/pci/devices/ to see if the PCIe device is missing as well as what modules are loaded, but I'm fairly certain it's the above kernel module.

I'm rather surprised that /var/log is attempting to be mounted in stage1, I certainly didn't expect that, especially in a systemd-based system.

Turns out this is the list of on-boot paths that nix uses to decide what to mount in stage1.

By the way, I would really like to be more helpful beyond just being a simple issue spammer.

I've learned something with every issue here, so it's been good; wouldn't call it spam. Seeing your repo has made me realise that a lot of the configuration that is in flake.nix here would be better placed in an importable module so it wouldn't need to be copy-pasted. Plus, after doing that it should be in pretty good shape to contribute to nixos-hardware.

Could you recommend some resources that would help me gain a better understanding of these things? I was thinking about Linux From Scratch, but perhaps there is something better?

I did LFS a long time ago, it's pretty good for getting a feel on how everything hangs together, and also how different build tooling work. It also follows FSH which NixOS doesn't. It looks like it hasn't adopted systemd either, which is good, especially for learning.

If you're only interested in configuring a kernel from scratch and also how initramfs's are used, then I'd suggest grabbing Arch or Ubuntu, building a kernel from scratch there. First, one that will boot the system successfully without an initramfs, and then second, rebuild the kernel with a few of the required drivers as modules, and then building an initramfs by hand to load those and pivot_root to the real system to boot it. One of the two Gentoo wiki pages on the subject should be good resources for this. Gentoo Wiki - Custom Initramfs Gentoo Wiki - Initramfs: make your own From skimming, I'd probably focus on the "Custom Initramfs" one.

ghostbuster91 commented 1 year ago

I finally found some poke around bpir again. Regarding the nvme boot problem, specifying:

boot.initrd.availableKernelModules = [ "nvme" ];

has fixed the issue :)

so it wouldn't need to be copy-paste

Tbh I am not sure I understand what you mean. It was copy-paste in the beginning but then I changed it to import your flake as an input, and nowadays there is not much of the low-level stuff there. My guess is that you looked at https://github.com/ghostbuster91/nixos-bpir3-example while I have moved to a new repository as working on a fork was less than convenient - https://github.com/ghostbuster91/nixos-router

Regarding "Linux from scratch/gentoo/initramfs"

thanks this is truly invaluable info :)

Last but not least, getting back to the main issue with flow offloading. I have switched to the full kernel. I can see that now the list of loaded modules is bigger:

$ lsmod                                                                                                                                                                                                                                               
Module                  Size  Used by
nft_masq               12288  1
nft_ct                 24576  2
nft_chain_nat          12288  1
nf_nat                 53248  2 nft_masq,nft_chain_nat
nf_conntrack          167936  3 nf_nat,nft_ct,nft_masq
nf_defrag_ipv6         24576  1 nf_conntrack
nf_defrag_ipv4         12288  1 nf_conntrack
nf_tables             249856  39 nft_ct,nft_masq,nft_chain_nat
libcrc32c              12288  3 nf_conntrack,nf_nat,nf_tables
nfnetlink              20480  1 nf_tables
crypto_safexcel       155648  0
md5                    12288  1 crypto_safexcel
libdes                 20480  1 crypto_safexcel
authenc                12288  1 crypto_safexcel
crct10dif_ce           12288  1
polyval_ce             12288  0
polyval_generic        12288  1 polyval_ce
sm4                    12288  0
cmdlinepart            12288  0
spinand                61440  0
sfp                    36864  0
nls_iso8859_1          12288  1
nls_cp437              16384  1
mdio_i2c               16384  1 sfp
i2c_gpio               16384  2
uio_pdrv_genirq        16384  0
uio                    20480  1 uio_pdrv_genirq
sch_fq_codel           16384  17
tap                    28672  0
macvlan                28672  0
fuse                  143360  1
nvme                   49152  4
nvme_core             139264  6 nvme
mt7915e               167936  0
mt76_connac_lib        69632  1 mt7915e
mt76                  106496  2 mt7915e,mt76_connac_lib
mac80211              942080  3 mt76,mt7915e,mt76_connac_lib
libarc4                12288  1 mac80211
cfg80211              958464  4 mt76,mt7915e,mac80211,mt76_connac_lib
rfkill                 32768  4 cfg80211

There is no nf_flow_table_inet but after calling sudo modprobe nf_flow_table_inet it appears on the list - so far so good.

However, while trying to configure nftables I am getting the same error as at the beginning. I have extracted running kernel configuration using cat /proc/config.gz | gunzip > running.config and verified that following modules are enabled:

CONFIG_NF_FLOW_TABLE_INET=m
CONFIG_NF_FLOW_TABLE=m
CONFIG_NFT_FLOW_OFFLOAD=m

Full configuration https://pastebin.com/aLdVj4yq

CONFIG_NF_FLOW_TABLE_INET is described as netfilter flow table mixed IPv4/IPv6 module so I would assume that it should load IPv4 and IPv6 relevant modules.

But maybe it does not. Do you think that we should also enable

CONFIG_NF_FLOW_TABLE_IPV4=m
CONFIG_NF_FLOW_TABLE_IPV6=m

Update: I realized that there are three nft families ipv4, ipv6 and inet (mixed), so these config options correspond directly to them. I modified my example to configure offloading for inet family instead of ip, as it is the one that is already enabled in the kernel, but the results were the same.

nakato commented 1 year ago

But maybe it does not. Do you think that we should also enable
CONFIG_NF_FLOW_TABLE_IPV4=m
CONFIG_NF_FLOW_TABLE_IPV6=m

These don't exist in the kernel anymore, it looks like they were removed sometime around v5.17.

However, while trying to configure nftables I am getting the same error as at the beginning.

I don't see why this kernel build would do that, it should work without needing to manually modprobe anything by hand, as that should be taken care of automatically.

I was able to load the following on my BPiR3, can you try it on yours?

table inet x {

    flowtable f {
        hook ingress priority 0; devices = { lo };
    }

    chain forward {
        type filter hook forward priority 0; policy drop;

        # offload established connections
        ip protocol { tcp, udp } flow offload @f
        ip6 nexthdr { tcp, udp } flow offload @f
        counter packets 0 bytes 0

        # established/related connections
        ct state established,related counter accept

        # allow initial connection
        ip protocol { tcp, udp } accept
        ip6 nexthdr { tcp, udp } accept
    }
}

In your flow table, the line that specifies devices = { }, what devices are listed there? You'll get the same, No such file or directory error, if an interface listed there doesn't exist at the moment the ruleset is being loaded.

ghostbuster91 commented 1 year ago

Indeed, I got confused about this No such file or directory error. I had there br-lan which I knew wouldn't work but I was expecting a silent failure instead. If I use the snippet that you provided with only lo everything seems to work fine. However, as soon as I try to include anything else besides lo the initial error appears again.

Can you try this snippet with lan0 or eth0?

nakato commented 1 year ago

Can you try this snippet with lan0 or eth0?

        hook ingress priority 0; devices = { lan0 };

        hook ingress priority 0; devices = { lo, lan0 };

        hook ingress priority 0; devices = { eth0 };

All of the above loaded without error.

With regard to the last one, I don't know if it's valid to use a flowtable on eth0. I think that interface would have the DSA tags on it, but that's just a guess.

Does the following work?

nft --check 'add table inet x; add flowtable inet x f { hook ingress priority 0; devices = { "eth1", "lan0", "lan1", "lan2", "lan3", "lan4" }; flags offload; }'

ghostbuster91 commented 1 year ago

nft --check 'add table inet x; add flowtable inet x f { hook ingress priority 0; devices = { "eth1", "lan0", "lan1", "lan2", "lan3", "lan4" }; flags offload; }'

Yes, it works.

I am using networkd to manage my network configuration. Maybe we are running into https://github.com/NixOS/nixpkgs/issues/141802 :thinking: I will try this later today.

nakato commented 1 year ago

All those interfaces should be available by the time we reach sysinit.target, after which nftables.service would be free to start up. In fact, as the modules are built in, those should all exist before we hit init in the initramfs. I haven't tried configuring the flowtable on-boot though. Are you also trying to include virtual interfaces that would be configured by networkd?

I notice that nixpkgs#141802 poses the potential solution of moving the firewall later in boot, but that's not an acceptable solution, at least not distro-wide. After=network-pre.target sets no explicit upper bound, so it'd be completely valid for the firewall to be the last unit started, after any network using service starts, leaving the system firewall unconfigured as network services begin listening; though it need not be that egregious to be exploitable. For example, if I could find a bug that crashes a server configured as such, I could crash the device ad-nauseam while I port-scan and attempt to exploit the device during the small window it does not have a valid firewall configured. The same problem would apply to Before=network.target and Before=network-online.target, just smaller windows and possibly less targets. Outbound traffic could be an issue too, depending on a users configuration.

The best way I can think of handling the flowtable would be to setup a systemd unit so it is started after the initial firewall is configured, and starts/stops/reloads whenever nftables.service does, while also being ordered after whatever virtual interfaces you need for it to be configured. There should be a systemd.device you can order on, like sys-subsystem-net-devices-lan0.device, or maybe something more networkd specific.

This is untested, but I think the systemd.service would need to look something like this. You'd probably also want to bind it to the device so it stops/starts when the device dependency goes away/re-appears.

[Unit]
Before=network-online.target
After=nftables.service OtherService.service
Requires=OtherService.service
BindsTo=nftables.service
ReloadPropagatedFrom=nftables.service

[Service]
ExecReload=nft -f <rules>
# Reload probably needs to be smarter as the table already exists, and you probably want to replace what is in it.
# If reload isn't specified, does reload stop and start?
ExecStart=nft -f <rules>
ExecStop=-nft delete table inet hwflow
# Allow ExecStop to fail as the rule might have been purged by a `flush ruleset`

With the rules being something like this.

table inet hwflow {

    flowtable ft {
        hook ingress priority 0; devices = { dev0, dev1, ... };
        # This is the only flowtable, so shouldn't need to do anything with priority
    }

    chain forward {
        type filter hook forward priority +1; policy accept;
        # I'm assuming that the default forward chain will have "type filter hook forward priority 0; policy drop;"
        # so any packets that make it to this lower-priority chain will be explicitly allowed traffic, but you should
        # validate this assumption.

        # offload established connections
        ip protocol { tcp, udp } flow offload @ft
        ip6 nexthdr { tcp, udp } flow offload @ft
    }
}

When nftables.service reloads/starts/stops it includes a flush ruleset, so these rules are purged, which is why this needs to be re-run when nftables.service reloads. The stop command of the flowtable service only deletes the table containing the flowtable, so the remainder of the rules would remain configured.

ghostbuster91 commented 1 year ago

Are you also trying to include virtual interfaces that would be configured by networkd?

No, I was only trying to make it work for lan(DSA) interfaces.

I notice that nixpkgs#141802 poses the potential solution of moving the firewall later in boot, but that's not an acceptable solution, at least not distro-wide. After=network-pre.target sets no explicit upper bound, so it'd be completely valid for the firewall to be the last unit started, after any network using service starts, leaving the system firewall unconfigured as network services begin listening; though it need not be that egregious to be exploitable. For example, if I could find a bug that crashes a server configured as such, I could crash the device ad-nauseam while I port-scan and attempt to exploit the device during the small window it does not have a valid firewall configured. The same problem would apply to Before=network.target and Before=network-online.target, just smaller windows and possibly less targets. Outbound traffic could be an issue too, depending on a users configuration.

You are right. I can't recall where but I have seen the same idea posted somewhere - which basically boils down to having two firewalls; one for the initialization, and the second one that will run after everything is configured and up.

I reached out to people on nixos-networking matrix with that question and they told me that the reason why I am seeing that error is that the rule check is performed in the sandboxed environment (network namespace) where nftables cannot access interfaces. Disabling the rule check should fix the problem (networking.nftables.checkRuleset = false).

I did that, and the rule was applied. nft list ruleset reports:

table inet filter {
        flowtable f {
                hook ingress priority filter
                devices = { br-lan, lan0, lan1, lan2, lan3, lo, wan }
        }

        chain input {
                type filter hook input priority filter; policy drop;
                ip protocol { tcp, udp } flow add @f
                iifname "br-lan" accept comment "Allow local network to access the router"
                iifname "wan" ct state { established, related } accept comment "Allow established traffic"
                iifname "wan" icmp type { destination-unreachable, echo-request, time-exceeded } counter packets 0 bytes 0 accept comment "Allow select ICMP"
                iifname "wan" counter packets 11 bytes 1719 drop comment "Drop all other unsolicited traffic from wan"
                iifname "lo" accept comment "Accept everything from loopback interface"
        }

        chain forward {
                type filter hook forward priority filter; policy drop;
                iifname "br-lan" oifname "wan" accept comment "Allow trusted LAN to WAN"
                iifname "wan" oifname "br-lan" ct state established,related accept comment "Allow established back to LANs"
        }
}
table ip nat {
        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
                oifname "wan" masquerade
        }
}

ethtool seems to confirm this as well:

 ethtool -k lan0 | grep offload                                                                                                                                                                                                                            
tcp-segmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

Now I need to figure out how/if that can be used with conjunction of using bridge interface, but that is another story, not related to this issue.

If you are happy with this result feel free to close this issue, otherwise we can continue digging further :)

nakato commented 1 year ago

the reason why I am seeing that error is that the rule check is performed in the sandboxed environment

Oh, I thought the error was occurring at runtime, not during the derivation build. That explains a lot.

ethtool seems to confirm this as well:
...
hw-tc-offload: on
...

In the case of hw-tc-offload it is only reporting that the hardware supports it, not that it is in use.

Without flags offload; defined in the flowtable it will be using the software fastpath. nf_flowtable.rst - Hardware offload

If you are happy with this result feel free to close this issue, otherwise we can continue digging further :)

Yea, I'm happy to close this.

ghostbuster91 commented 1 year ago

Hi, just a fyi - I was able to figure out how to combine that with a bridge interface. Basically it is opaque from the bridge point, so you just add interfaces to the flow definitions. Full example here https://github.com/ghostbuster91/nixos-router/pull/30/files#diff-97507722ffda08c8542d51270c3c45765f50a284bf85e322027140afc7f4293fR30