ruma / lb

Ruma wrapper for low-bandwidth matrix
MIT License
2 stars 0 forks source link

Proper DTLS support #7

Open ShadowJonathan opened 2 years ago

ShadowJonathan commented 2 years ago

It seems that DTLS support was more complex than i thought it would be from #3, so i'll use this issue to lay down my thoughts.


Basically, rust-openssl does not have good DTLS support, this is not a significant failure on rust-openssl's part, as openssl itself is not adhering to the RFC either.

See the following section from the RFC;

When a DTLS implementation receives a handshake message fragment, it MUST buffer it until it has the entire handshake message.

And compare it to the manpage on DTLSv1_listen, to accept incoming DTLS sessions;

Since DTLSv1_listen() operates entirely statelessly whilst processing incoming ClientHellos it is unable to process fragmented messages (since this would require the allocation of state). An implication of this is that DTLSv1_listen() only supports ClientHellos that fit inside a single datagram.


Looking around a little more, i found an issue pertaining to DTLS support in python trio, which has the following to say about it;

OpenSSL supports DTLS, but really as an afterthought, basically by wedging it into a TLS-shaped box.

And the following about openssl's multiplexing shortcomings;

A DTLS socket, like all UDP sockets, can handle lots of peers simultaneously, and act as both a client and a server to different peers. OpenSSL assumes that each transport has a single peer. So it's the user's job to figure out which packet belongs to which OpenSSL connection, and route them appropriately.

Solution: handle the actual socket I/O ourselves. When a packet comes in, use the source address to look up the appropriate OpenSSL connection object, and pass it in.

About packets;

A DTLS socket is packet-based. OpenSSL uses a pluggable transport layer called "BIO"s, and they have a concept of a "packet BIO", but there's no built-in "memory packet BIO". So we can either implement our own BIO, or use some hacks to make the existing memory BIO work.

Solution: Making memory BIOs work for regular read/write calls is easy, because each read/write corresponds to a single packet. For handshakes, it's trickier, because a single handshake "volley" might include multiple packets, which OpenSSL will happily concatenate into the memory BIO's output buffer.

About handshakes;

For regular data packets, DTLS has the same semantics as UDP: if the packet gets lost, then on well, too bad. But that doesn't work for handshake packets -- those have to arrive successfully, or nothing else works. So DTLS uses a timeout-based mechanism where if one side notices that the handshake hasn't been progressing, then it resends its last set of packets.

OpenSSL has some support for this built in. But! It's hard-coded to use the system clock (among other bits of awkwardness). And we want to use the Trio clock, to make autojump_clock still work. So, I think we'll probably want to handle the retransmits ourselves.

And about the MTU, which i think is the biggest issue here;

The "path MTU" is the maximum size packet you can send to a particular destination without some router dropping it along the way. (It's a "path" MTU because packets to different destinations will pass through different routers, which might have different limits.) For example, the standard Ethernet MTU is 1500 bytes. So you normally can't send a UDP packet with 1600 bytes in it -- or, well, you can, but it will be instantly discarded.

[...]

OKAY. The other reason we need to know about MTUs is for the handshake. Handshake messages can potentially be really big, like tens of kilobytes, because certificate chains can be really big. Obviously if you try to stuff that into a single packet, then all your handshakes will fail and nothing will work at all. So DTLS has a mechanism to split a single handshake message up into multiple packets.

Now, what makes this tricky is that it interacts with retransmits. Remember how I said above that if handshake packets get lost, we have to handle our own retransmits? Well, one of the reasons they could get lost is that we're sending packets that are too big. So if our packets keep getting lost, we have to notice that and re-fragment the handshake message into new, smaller fragments.

Fortunately the fragmentation header fields are pretty simple: there's a single underlying handshake message you're trying to send, which we can read out from the packets that openssl generates, we split it up into whatever pieces we want, and then we slap on headers saying "these are bytes 0-1999 of the handshake message", "these are bytes 1000-1999 of the handshake message", etc. So it's all doable, though it requires writing an actual DTLS handshake record parser, which is unfortunate.

TLDR; DTLS startup times might be borked due to packets getting lost by a too-high MTU.

This is relevant for if we want to send packets over ipv4, which have a "worst case MTU" (as found here) of 576, minus 28 bytes of header overhead. This would only enabled sure-fire block-wise transfers of 512, which, together with the problems described in #6, could half the transfer speed. So potentially we would want to set up a "MTU prober" as well somewhere inbetween sending packets and finalizing the handshake, to maximise transfer speeds.

I dont want to use Path MTU discovery for this, as it assumes a best-case scenario first, and then backs off when it doesn't encounter one, it could delay a handshake by several RTTs, and we want to be speedy.


The issue also makes me think that DTLS v1.3 is not really "compatible" with 1.0 and 1.2 to a degree, just like TLSv1.3, so support for that might be not "as simple" to multiplex and support simultaneously in an implementation.


Additionally, the issue also has this bit;

There are two APIs for this: DTLS_set_link_mtu, and SSL_set_mtu. The former is supposed to be passed the link-layer MTU (e.g. 1500 for ethernet), and then it queries the BIO to ask what the header overhead is for this particular socket. Of course, since we'll be using memory BIOs, this doesn't work. OTOH, SSL_set_mtu is passed the MTU after this overhead is accounted for (e.g. 1500-28=1472 for UDP over IPv4 over ethernet). So that's what we want. HOWEVER, at the end of the handshake, OpenSSL normally discards whatever you passed to SSL_set_mtu and then tries to query the BIO for it. To avoid this, you have to set SSL_OP_NO_QUERY_MTU.

(No, none of this is documented, why do you ask?)

...which doesn't exactly inspire confidence to use openssl for this :(


The issue also talks about cookie validation, and how openssl is absolutely not helpful with that.


DTLS handshakes (according to the RFC) also look to be 2 RTTs, but potentially more as packets get lost (due to MTU or the likes).


go-coap

How does go-coap deal with this? Well, it uses a "custom" Golang DTLS implementation, completely separated from openssl, thus avoiding most of these issues.

I think this doesn't bode entirely well for the ecosystem of DTLS, if this is the state of openssl support, and golang has to roll their own for their own support.

rustls has an issue open for DTLS support, and i'm suspecting they're facing the same confusion/problems as openssl has for jamming DTLS in there, and the latest comment from a member there reflects this.


There exists another implementation of dtls in rust, webrtc has one, but its API is... weird, it's certainly not geared towards low-level implementation such as what we're trying to do.

ShadowJonathan commented 2 years ago

For a potential API, we need to separate sync and async entirely, but we should and probably could keep a sync-based "central" struct which handles all the dirty work andsoforth.

Note: These design ideas are superseded by https://github.com/ShadowJonathan/dtls-rs/issues/1

I'm currently thinking about these sync objects;

And these async objects;

With "multiplexing" i'm reflecting option 2 i had mentioned here, but i've not properly elaborated upon; Whoever is calling listen on the socket at any time, could be awaiting a Condvar (or something like it), or could be performing a raw recv on the socket itself, bouncing it between the threads that're waiting for it.

If a thread is receiving a packet for a source it is not awaiting for, it pushes it to an internal buffer and notifies the corresponding thread to wake up to the new data, while it polls the socket again.

This model could also become a base for the async tokio model, in which this would potentially be much more efficient (as awaiting a socket there is event-based).

The same could be applied to DTLSClient and DTLSServer in sync land, where any thread or operation awaiting on a derivative DTLSSession would await the socket itself, and act as the "rotating doorman" for that socket.


All of this is needed because UdpSocket does not derive something like a TcpStream from connecting to an endpoint. And when a UdpSocket has connected to an endpoint, it will discard any other packet from any other endpoint, so to take ownership of the UdpSocket and to set a filter in front of it, is to remove any capability for that UdpSocket (addr + port) to listen to other endpoints, requiring new sockets, which may not be something we want.

ShadowJonathan commented 2 years ago

I've created the multiplexing library here; https://github.com/ShadowJonathan/exit-left

ShadowJonathan commented 2 years ago

Currently working on it here; https://github.com/shadowjonathan/dtls-rs