What to do about MTU and Fragmentation

asedeno commented 7 months ago

Standard 802.3 Ethernet frames can be up to 1518 octets in length, 1522 octets when tagged (802.1Q VLANs), and 1526 octets when double-tagged (802.1ad Stacked VLANs).

This includes up to 1500 octets of payload, and the L2 framing itself:

- 6 octets       - Destination MAC
- 6 octets       - Source MAC
- (4 octets)     - VLAN Tag (802.1Q)
- (4 octets)     - VLAN Stacking (802.1ad, or QinQ)
- 2 octets       - Ethertype or Length
- 46-1500 octets - Payload
- 4 octets       - Frame Check Sequence (CRC)

The CRC can be dropped and recomputed. The rest of the framing must be transmitted in one form or another. Assuming a maximal payload plus the required 14 octets of header, and no VLAN tags, we may need to carry a maximal payload of 1514 octets.

When using this draft over HTTP/1.1 or HTTP/2, the underlying transport is TCP, which provides a reliable and ordered stream of octets. Ethernet Frames are transmitted using HTTP Datagrams and the Capsule Protcol defined in RFC9297, and will naturally be fragmented as needed.

When using this draft over HTTP/3, things get interesting. Ideally, Ethernet frames would be transmitted over HTTP Datagrams using QUIC Datagrams (RFC9221). Such Datagrams are required to fit inside a QUIC Packet, and are therefore limited by the underlying network MTU and further limited by the overhead of H3/QUIC/UDP/IP.

Trivially, if an Ethernet frame is too large to be transmitted this way, we could fall back to Capsules over the connection's HTTP stream. However, this comes with the cost of Head of Line blocking for all frames so transmitted in addition to the overhead of fragmentation.

I think we should recommend some form of MTU negotiation such that Ethernet frames being proxied are guaranteed to fit within the MTUs of both attached networks and the maximum size H3 Datagram for the connection between them.

An added complication is that the path may not be fixed, and so the path MTU may vary over time, and we'll need to think about how to handle that too.

If an operator is in full control of the network and both Ethernet proxying endpoints, they should adjust the underlying MTU to support whatever proxied MTU they require.

When the underlying network MTU is not adjustable and the operator requires a larger MTU than would be supported by the underlying network, fragmentation is necessitated.

What knobs do we need to provide here?
What should we recommend with regards to how to tune those knobs?
What warnings labels should we attach to them?

Let's discuss.

DavidSchinazi commented 7 months ago

The MTU implications of connect-ethernet are very similar to those of connect-ip, the only difference is that the amount of overhead is slightly larger. I'd suggest reusing the solutions from RFC 9484. My assumption is that Ethernet doesn't require an MTU of 1500 to work properly, is that incorrect?

asedeno commented 7 months ago

You're right, it doesn't. However, this was a concern brought up at IETF 118, and so I want discussion around it.

achernya commented 7 months ago

@DavidSchinazi

My assumption is that Ethernet doesn't require an MTU of 1500 to work properly, is that incorrect?

IIRC the Ethernet MTU of every device in a broadcast domain has to match, otherwise you will get blackholing. I believe we can't rely on PMTU within the broadcast domain, because the Ethernet MAC will drop packet-too-big and we don't get the opportunity on the host networking stack to send ICMP packet-too-big messages.

As a result, I think if we want to support full-sized ethernet MTU we need to support fragmentation (unreliable delivery) or do reliable streams.

DavidSchinazi commented 7 months ago

Can you elaborate on why the MTU has to match? PMTUD still works fine on the Internet even in the absence of ICMP packet-too-big.

achernya commented 7 months ago

AIUI an ethernet frame that is too large for the device's MTU is dropped at the MAC layer, before it even hits any OS or user code. This means that while protocols that do PMTUD can discover and work around this black hole, it won't work for any protocols that expect the local broadcast domain to be "well-behaved". I do suspect that a surprisingly large number of protocols for which we'd want a L2-VPN-style connection fall into this category.

alvestrand commented 6 months ago

Don't we have experience with connecting jumbogram Ethernets to non-jumbogram Ethernets? Shouldn't we emulate that behavior (whatever it is)?

ietf-wg-masque / draft-ietf-masque-connect-ethernet

What to do about MTU and Fragmentation #1