isovalent / ebpf-docs

An effort to comprehensively document eBPF
https://ebpf-docs.dylanreimerink.nl/
BSD 2-Clause "Simplified" License
160 stars 16 forks source link

Add some detail around MTU maximums to XDP program type page #17

Closed dylandreimerink closed 3 months ago

dylandreimerink commented 5 months ago

For normal XDP programs the expectation is that the full packet is available between ctx->data and ctx->data_end and be linear memory. At the level XDP runs, no consecutive memory pages can be guaranteed, so such a packet + metadata has to fit in exactly 1 memory page (typically 4k). This means that minus stuff like headspace and skb metadata actual max packet size is smaller. Drivers therefor tend to block XDP programs from being attached if the MTU is to high. This seems like an arbitrary limitation for new users and also a poorly documented limitation.

So its would be nice if we could dedicate a section of the XDP program type page to explaining this limit, how and why the numbers change between drivers and perhaps some other factors at play such as the impact of XDP frags and other driver specific settings.

dylandreimerink commented 5 months ago

I had some time to go over the code for different drivers and make an initial overview by copying relevant max MTU calculation code and running it to get some initial numbers:

| driver          | Max MTU normal   | Max MTU frags |
| --------------- | ---------------- | ------------- |
| Veth            | 3520[1]          | 73152[1]      |
| Tun             | 1500             | x             |
| Virtio          | 3502             | :infinity:    |
| xen-netfront    | 3840             | x             |
| Bond            | [2]              | [2]           |
| ENA             | 3498             | x             |
| AQ              | 2048             | :infinity:    |
| BNXT            | 3500             | :infinity:    |
| Cavium Thunder  | 1508             | x             |
| Englender       | :infinity: :sus: | :infinity:    |
| Freescale FEC   | :infinity: :sus: | :infinity:    |
| Freescale DPAA  | 3578             | x             |
| Freescale DPAA2 | [3]              | x             |
| Freescale ENETC | :infinity: :sus: | :infinity:    |
| FunEth          | 3566             | x             |
| GVE             | 2032             | :infinity:    |
| I40E            | 3046[4]          | 9702[4]       |
| ICE             | 3046[5]          | :infinity:    |
| IGB             | 3046[6]          | x             |
| IGC             | 1500             | x             |
| IXGBE           | 3050[7]          | x             |
| IXGBVE          | 3046[6]          | x             |
| Marvell nvneta  | 3424             | :infinity:    |
| Marvell MVPP2   | 3552             | :infinity:    |
| Marvell OTX2    | 1508             | x             |
| MediaTek        | 1508             | x             |
| Mlx4            | 3498             | x             |
| Mlx5            | 3496             | :infinity:    |
| Lan966x         | :infinity: :sus: | :infinity:    |
| Mana            | 3506             | x             |
| Netronome       | [3]              | [3]           |
| qede            | :infinity: :sus: | :infinity:    |
| SFC             | 3490[8]          | x             |
| Stmicro         | 1500             | x             |
| ti              | :infinity: :sus: | :infinity:    |
| hyperv          | 3520             | x             |
| vmxnet3         | 3492             | x             |

[1]: On x86, ARM64, and PowerPC. Else 3518 / 73150
[2]: Depends on slave devices
[3]: Depends on NIC firmware
[4]: Unless running in legacy ENA mode, then 2006
[5]: Unelss running in legacy RX mode, then 1638
[6]: If using large rings, else if "build skb" 1508, else 2022
[7]: If using 3K buffer, else if "build skb" 1514, else 2026
[8]: If falcon/siara, else if A10 3530, A100 3522

Now, these numbers are based on struct sizes in my particular kernel and constants from linux master. So they can move around over kernel versions. I am going to include these in the docs initially with a warning. After that I want to improve this by converting the MTU calculation code for each driver into something we can more easily run against variables such as different page sizes, struct sizes for the LTS kernels ect. This should produce a more accurate matrix of max MTUs. Then figure out some way to include that in the docs in a readable way.