The issue is created to collect requirements from the Performance study of kernel TLS handshakes talk.

This effort extends the existing kTLS mechanism by TLS handshakes. In the proposed design the whole TLS connection can be handled by the kernel, including the handshake, alerts and data transmission and receiving.

Besides 40-80% better performance, the kernel TLS allows to manage security sensitive private keys in a separate privileged process and work with TLS connection in worker processes which have no direct access neither to private nor session keys.

This is server-side only acceleration for TLS 1.3 (#1031) handshakes. There are more and more people advocating the movement from TLS 1.2 to 1.3, so there is no sense to move the legacy protocol into a new feature for the kernel.

The socket API

Private key and certificate loading are described in https://github.com/tempesta-tech/tempesta/issues/1332 and are done by the privileged process, responsible for management of the security sensitive data. A worker process, servicing remote user data, deals with the socket API using the keys' IDs and doesn't have access neither to the private key not to session keys. Depending on the implementation, the both the processes can be the same process, but it's appreciated to separate the roles.

Firstly, a normal socket (TCP for current HTTP of versions 1 or 2 or UDP for QUIC) is created and bound to a listening port:

sd = socket(AF_INET6, SOCK_STREAM, 0);
bind(sd, ...);

Next, a TLS context is created for the keyring my_sni_kr (see #1332) and setsockopt() is called to setup the TLS context for the listening socket:

struct tls12_crypto_info_aes_gcm_128 ci = {
    .versions = TLS_1_2_VERSION | TLS_1_3_VERSION,
    .tls_12_cipher_suites = ECDHE_ECDSA_AES128_GCM_SHA256 | ECDHE_ECDSA_AES256_GCM_SHA384,
    .tls_13_ciphersuites = TLS_AES_128_GCM_SHA256 | TLS_AES_256_GCM_SHA384,
    .ecurves = secp256r1 | x25519,
    .tls_keyring = my_sni_kr,
};
setsockopt(sd, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
setsockopt(sd, SOL_TLS, TLS_HS, &ci, sizeof(ci));

TBD Note that the protocol versions, cipher suites, and elliptic curves are defined as flags, which provides some level of flexibility for further development. However, this does not allow to define different settings for each SNI on the same listening socket. If such level of flexibility is required, then a netlink interface should be considered. There are cases when several pairs of private keys and certificates must be loaded for an SNI. Technically, in this case we can pass an array of keyrings into setsockopt() or a user can make subsequet setsockopt() calls to pass set of TLS settings descriptors.

The TLS handshake happens on a listening socket. If the kernel implementation lacks some necessary logic to finish the hanshake, but there is no error on TLS layer, then new_sd is the descriptor of TCP socket with established connection and the user space caller must finish the handshake on its own.

listen(sd, ...);
int new_sd = accept(sd, ...);

Now, the socket is ready to read and send data with decryption and encryption on the kernel layer:

read(new_sd, decrypted_data, ...);

The server can close the socket using normal close(2) or shutdown(2) system calls. With SHUT_RD shutdown(2) mode, any read operation from the user space ends with an error, but the socket is still reading mode to work properly with TLS session closing (at least to process TLS alerts).

Network processing

TLS handshakes must be done in softirq context, just like TCP. To speedup cryptographical computation, if the TLS handshakes configuration option enabled, then softirq must acquire FPU context on getting CPU time and release when it exits. This way TLS handshakes can be processed in batches, eliminating extra FPU save/restore contexts.

Processing context

While current kTLS works in process context, the TLS handshakes need a separate context to efficiently and correctly handle non-blocking sockets. Softirq provides the separate context and the best performance. However, it seems the heavy computation might lead to larger network latencies, see #1434. Work queues provide lower performance. Alternatively, designated per-cpu kernel threads can be used for handshakes.

Non-blocking sockets

Polling system calls (epoll(2), poll(2), select(2)) must return a TLS socket with as ready only when TLS handshake is completed.

It seems we can use io_uring for the asynchronous socket operations. The design is TBD, but is required since the mechanism allows to do smaller number of system calls.

Fallback to a user space implementation

The kernel must not include all the TLS features, but accelerate the most used and most performance critical logic. During parsing ClientHello message the implementation must go further in the handshake state machine processing if it's able to complete the handshake or return a socket with established TCP connection from the accept() system call.

As less system calls as possible should be called on the fast path, so a new protocol family PF_TLS must be introduced, so a user space can check the status of the socket from accept(2):

struct sockaddr_in sa;
int sd = accept(listen_fd, (struct sockaddr *)&sa, sizeof(sa));
if (sa.sin_family == PF_TLS) {
    /* we're good and can start to do our I/O operations on the socket */
} else {
    /*
     * The kernel doesn't support some TLS features, do the handshake on our own.
     * Call normal TLS library routines to accept a new TLS connection.
     */
}

If the kernel fails to establish the TLS handshake, the the user space must be able to read ClientHello message from sd and continue with the handshake. It's guaranteed that the first read operation won't block.

Cryptography

Considerations, which probably should be implemented as separate tasks for later milestones:

RSA 2048 and 4096, NIST p256, and x25519 must be supported in ECDSA and ECDHE.
Tempesta TLS focuses on server-side workloads and has specific hardware requirements. However, the Linux kernel version must work on wider set of hardware, so we can not rely on RDRAND and need to adjust ECC algorithms to run in constant time and not to rely on fast randomization.

Usage PoC

It makes sense to test the implementation with some PoC, like NFSD patched to use the kernel extension and make https://github.com/oracle/ktls-utils just to load the certificate and key and backoff to handshakes, which can not be established with the kernel implementation.

Testing

The list of the tests is TBD.

[ ] TLS connection deadlocks from the IETF discussion TLS 1.3 and TCP interactions

tempesta-tech / tempesta

Linux kernel TLS #1433