Want smarter use of existing structure in `Packet::emit_packet_headers`

To process a packet, we currently read each layer and then advance the r/wptr fields of the mblk_t containing the packet and its headers. This leaves the original headers in place. We then determine how many bytes the complete header stack would require (Option<encap{L2,L3,L4}> + Ulp {L2, L3, L4}), before using the existing space or allocating a new mblk to be placed at the front of the chain and copying the entire (post-transform) header set back into place.

However, many of our transforms when packets are Modified do not alter the entire header stack, or even alter the structure of ULP packet headers. While needing to add or remove the Geneve header group is universal, VPC-internal traffic would actually require no changes to the ULP, and would in effect only modify L2 addrs of transformed traffic. VPC-external packets typically only require that we change L3 src/dest addrs or L4 ports (due to NAT/SNAT) in addition to this (and, accordingly, the checksum).

It would likely be more clever to:

Determine whether any parts of the output headers have structurally changed. I.e., the addition of new layers, or extension fields. If the structure of a header group has not changed, then we can update the modified fields – and only the modified fields – in place.
Allow the encap packet headers and ULP packet headers to exist within different segments. This will allow us to only allocate and write the Geneve header stack from scratch, and apply the above optimisation to ULPs on all encap'd traffic.

This might in theory be extended to Hairpin packets under limited circumstances, but I believe that through-traffic is the most important occurrance to capture.

I did some poking around in ubench to see what a reasonable upper limit in time savings is here before embarking on a redesign of PacketMeta, based on a very rough strawman:

Changes to `userland.rs`

```rust |(mut pkt, dir)| { let _ = black_box(|| { // I.e., btreemap lookup, lock acquire. port.port.pseudo_process(dir, &mut pkt, ActionMeta::new()); match dir { In => { // Write fields (ETH dst, IP DST) // Recompute Cksum. // Chop off encap. let inner_len = pkt.meta().inner.hdr_len(); let mut ipsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_ip4().unwrap().csum).into(); let mut tcpsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_tcp().unwrap().csum).into(); let old_addr = pkt.meta().inner_ip4().unwrap().dst; pkt.segs[0].expand_start(inner_len); let mut wtr = pkt.segs[0].get_writer(); let bytes = wtr.slice_mut(inner_len).unwrap(); // ETH [0..14] bytes[0..6].copy_from_slice(&mac); // IP [14..34] bytes[30..34].copy_from_slice(&int_ip); ipsum.sub_bytes(&old_addr); ipsum.add_bytes(&int_ip); bytes[24..26].copy_from_slice(&ipsum.finalize().to_be_bytes()); tcpsum.sub_bytes(&old_addr); tcpsum.add_bytes(&int_ip); // TCP [34..54] bytes[50..52].copy_from_slice(&tcpsum.finalize().to_be_bytes()); }, Out => { // Write fields (ETH src, IP SRC) // Recompute Cksum. // Push precomputed encap. let inner_len = pkt.meta().inner.hdr_len(); let mut ipsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_ip4().unwrap().csum).into(); let mut tcpsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_tcp().unwrap().csum).into(); let old_addr = pkt.meta().inner_ip4().unwrap().src; pkt.segs[0].expand_start(inner_len); let mut wtr = pkt.segs[0].get_writer(); let bytes = wtr.slice_mut(inner_len).unwrap(); // ETH [0..14] bytes[6..12].copy_from_slice(&mac); // IP [14..34] bytes[26..30].copy_from_slice(&ext_ip); ipsum.sub_bytes(&old_addr); ipsum.add_bytes(&ext_ip); bytes[24..26].copy_from_slice(&ipsum.finalize().to_be_bytes()); tcpsum.sub_bytes(&old_addr); tcpsum.add_bytes(&ext_ip); // TCP [34..54] bytes[50..52].copy_from_slice(&tcpsum.finalize().to_be_bytes()); let mut seg = unsafe { let mp = allocb(encap_len); PacketSeg::wrap_mblk(mp) }; // NOTE: encap_dummy_bytes is a prebuilt vector. seg.expand_end(encap_len); let mut wtr = seg.get_writer(); let wrt = wtr.slice_mut(encap_len).unwrap(); wrt.copy_from_slice(&encap_dummy_bytes); seg.link(&pkt.segs[0]); pkt.segs.insert(0, seg); }, }; }); }, ``` (There are a bunch of extra `pub` modifiers added to existing elements to make this run.)

TL;DR: it captures a reasonable subset of a VPC-external traffic flow: we grab a lock on the port, lookup the flowkey in the btreemap, source/dest IP is rewritten, source/dest MAC is rewritten, checksums are updated and rewritten, and in the outbound case we memcpy a pre-prepared packet header stack of length ETH + v6 + UDP + Geneve + GeneveExt.

Roughly, perf looks like (ignoring the vague complaints about regression):

#
# before
#
Benchmarking process/ULP-FastPath/wallclock/V4-Tcp-OUT-1400B: Collecting 100 samples in estimated 5.0084 s
process/ULP-FastPath/wallclock/V4-Tcp-OUT-1400B
                        time:   [293.83 ns 295.23 ns 296.70 ns]
                        change: [+1.0641% +1.7977% +2.4187%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

process/ULP-FastPath/wallclock/V4-Tcp-IN-1400B
                        time:   [216.11 ns 217.07 ns 218.10 ns]
                        change: [+1.1622% +1.6947% +2.2003%] (p = 0.00 < 0.05)
                        Performance has regressed.
#
# after
#
Benchmarking strawman/V4-Tcp-OUT-1400B(hypothetical): Collecting 100 samples in estimated 5.0071 s (2.2M it
strawman/V4-Tcp-OUT-1400B(hypothetical)
                        time:   [73.838 ns 74.196 ns 74.560 ns]
                        change: [-1.2417% -0.5123% +0.2084%] (p = 0.18 > 0.05)
                        No change in performance detected.

strawman/V4-Tcp-IN-1400B(hypothetical)
                        time:   [72.621 ns 73.367 ns 74.216 ns]
                        change: [-0.9970% +0.0679% +1.1228%] (p = 0.91 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

So in the best case for VPC-external traffic, Port::process could get around a ~3X speedup on inbound packets and ~4X speedup on outbound packet processing -- assuming we can build up the transforms for UFT hit packets at least.

While it's worth knowing that there's a big gap here, the effort needed to enable this (i.e., lazy reads + dirty write tracking, knowledge of structural changes) is probably pretty substantial relative to the gain. For instance, ballpark from flamegraphs indicates:

Port::process -- ~30% (pre-#475)
mac_tx -- ~27%
next_hop -- ~21%
Packet::wrap_mblk_and_parse -- ~13%

So I guess solving this will give us a ≤1.3X speedup, whereas doing some route caching can probably get us a similar speedup with less effort. I'll puzzle this out for a while.

oxidecomputer / opte

Want smarter use of existing structure in `Packet::emit_packet_headers` #481