microsoft / net-offloads

Specs for new networking hardware offloads.
MIT License
27 stars 3 forks source link

QUIC offload: Standardizing User-Kernel interface #56

Open rawsocket opened 1 year ago

rawsocket commented 1 year ago

There are a few points worth mentioning for the User-Kernel API. Depending on the use case, the destination L3/L4 and connection ID might not be sufficient. There are cloud use cases that use ephemeral destination ports and zero-length connection ID while sharing the same transmitting socket on the server side. This particular use case would require the source connection ID information to be part of the hash/key lookup value. Most of these complications in finding an encryption key to match the flow come from the specific use cases and QUIC being UDP based and not always using connected sockets.

Supporting Linux approach to minimize the lookup on transmit to 0, matching the flow in hardware by hash might be a slowing part. It would also increase the memory pressure to hold more information about the flow than needed to perform the lookup, as well as the pressure on caching some flow data and doing DRAM lookup for other. Having an index allocated for each key at least on Tx could be a benefit.

This brings another discussion about making such a change - the setsockopt/getsockopt aren't a very good candidates to do a bidirectional conversation between user and kernel. Historically, their buffers were used to silently carry information backwards into user app in the provided buffer, but with the increased use of BPF this became a no-go; the original buffer is not delivered to ULP as it would in the earlier versions of the Linux kernel anymore. A more up-to-date approach using Netlink sockets could be needed for it.

Receive path should do the lookup, though, which is unavoidable. And here again, the lookup wouldn't be enough for a triad, and rather would need a source L3/L4 information to correctly map the key in the hardware.

Weighing pros and cons for kernel-only QUIC encryption offload demonstrated that the hardware support is required to make it successful and justified.

The level of hardware support is yet to be seen. It might cover as minimum as encryption and decryption; it might do key rotation and next packet ID handling internally in the hardware. Validating and weighing all these options shows that the control plane of QUIC should rather stay within the QUIC library in user space, with the hardware doing bare minimum to deliver improvements to the two main areas: memory bandwidth and CPU usage.

Having said this, it would also be necessary to define the data path transmission - which attributed to send down to the hardware: connection ID length is a must to have, the next packet ID might also be needed if at some point the hardware would bypass the crypto offload back to software due to reconfiguration or other inabilities to operate. Later resynchronization would be necessary if the sequence is maintained separately between the hardware and the software.

The packet layout itself would require calibration of the user space code in terms of the preallocated space for QUIC header, filling it with values, preallocating space for UDP and IPv6 to minimize copies, use of MSG_ZEROCOPY on the path, adaptation of the same logic for use cases where a bypass engine is working as a transmission intermediary instead of a kernel, etc. GSO fragment size in all these cases must be considered as it's requested by the use space application as well.

Also, socket might not be the only mechanism in a non-bypass operation. The io_uring is another method of delivering data, which could be used by applications in connection with custom kernel support if needed.

nibanks commented 1 year ago

Note - In the future, please separate out different topics into separate GitHub issues

There are a few points worth mentioning for the User-Kernel API. Depending on the use case, the destination L3/L4 and connection ID might not be sufficient. There are cloud use cases that use ephemeral destination ports and zero-length connection ID while sharing the same transmitting socket on the server side. This particular use case would require the source connection ID information to be part of the hash/key lookup value. Most of these complications in finding an encryption key to match the flow come from the specific use cases and QUIC being UDP based and not always using connected sockets.

Supporting Linux approach to minimize the lookup on transmit to 0, matching the flow in hardware by hash might be a slowing part. It would also increase the memory pressure to hold more information about the flow than needed to perform the lookup, as well as the pressure on caching some flow data and doing DRAM lookup for other. Having an index allocated for each key at least on Tx could be a benefit.

I assuming having the index of the previously offloaded key would solve all the problems above, right? Though this does bring up an issue of access control. You have to make sure one app doesn't try to use the keys from another. Having a simple index might make it easier to steal/use another app's key.

This brings another discussion about making such a change - the setsockopt/getsockopt aren't a very good candidates to do a bidirectional conversation between user and kernel. Historically, their buffers were used to silently carry information backwards into user app in the provided buffer, but with the increased use of BPF this became a no-go; the original buffer is not delivered to ULP as it would in the earlier versions of the Linux kernel anymore. A more up-to-date approach using Netlink sockets could be needed for it.

I don't know anything about "Netlink sockets". Would you mind elaborating or providing a good pointer to better understand? How would this work cross platform?

The level of hardware support is yet to be seen. It might cover as minimum as encryption and decryption; it might do key rotation and next packet ID handling internally in the hardware. Validating and weighing all these options shows that the control plane of QUIC should rather stay within the QUIC library in user space, with the hardware doing bare minimum to deliver improvements to the two main areas: memory bandwidth and CPU usage.

Agreed. The actual QUIC stack would own key rotation and any complicated logic of what and when to provision to the HW. The HW just uses what it has to do the CPU intensive work.

Having said this, it would also be necessary to define the data path transmission - which attributed to send down to the hardware: connection ID length is a must to have, the next packet ID might also be needed if at some point the hardware would bypass the crypto offload back to software due to reconfiguration or other inabilities to operate. Later resynchronization would be necessary if the sequence is maintained separately between the hardware and the software.

Currently, there is discussion on artificially requiring the same connection ID length globally, to reduce complexity on the HW. If we do this, then the length doens't have to be passed down on the send path.

The packet layout itself would require calibration of the user space code in terms of the preallocated space for QUIC header, filling it with values, preallocating space for UDP and IPv6 to minimize copies, use of MSG_ZEROCOPY on the path, adaptation of the same logic for use cases where a bypass engine is working as a transmission intermediary instead of a kernel, etc. GSO fragment size in all these cases must be considered as it's requested by the use space application as well.

Needing to preallocate space for UDP and IPv6 is slightly orthogonal to QEO, IMHO. It could be helpful even without QEO.