Open chrysn opened 3 years ago
Concrete data point from one nal implementation (RIOT): The stack keeps a receive buffer per connection. It currently has no facilities to extract the number of pending bytes, but it ought to be possible to dig that information out; it is conceptually there somewhere. With my maintaining-riot-wrappers hat on, I'd suggest a TcpClientStackWithReadExact, which RIOT could implement at some point.
Taking the other implementer hat of std-embedded-nal: It's trivial because read_exact
is there. (But if it weren't, it could be emulated by making the stream peakable through a heap vec; for that implementation that'd be acceptable, for others it would probably not).
I worry that any attempt at forcing buffering in the embedded-nal
layer would preclude zero-copy implementations mentioned in https://github.com/rust-embedded-community/embedded-nal/issues/12, which I believe would be quite useful.
If we could make a small, easy shim default implementation, I would be okay with implementing this, but I'd like to keep the traits as small and lightweight as possible personally.
I'm not convinced this is necessarily conflicting with zero-copying on a conceptual level. (The concrete proposed receive_exact method is, but then again so is receive; a bytes_available or await_ready_bytes(n) would). To some extent this may even help a TCP stack assist keeping copies down with the zero-copy interfaces, in that it informs it to set aside this much space for a later read. (That wouldn't be zero-copy exactly in most stacks, but still one less copy at the interface between stack and application; zero-copy access to contiguous data in TCP can, in general, hardly be provided).
A shim should work; a WithReadBuffer<const S: usize>
could wrap any TcpClient (more precisely, its sockets) and provide a receive_exact
function backed by some form of [u8; S]
, whereas thusly capable backends could implement WithReadBuffer themselves. (A provided .buffered<S>()
method that's overwritten to return just self makes sense to construct those, probably); I'd give it a try.
Revisiting this after the RIOT summit, I think that I can describe more clearly now how this can work with zero copy:
For UDP, input at scatter-gather backends will, for zero-copy, need some form of iterating or chained data retrieval. For example, if the network library manages a "heap" in the form of 256-byte slabs, and a 600 byte message is received, then the read result of a single message will, in some form, be an owned form of the three involved slabs.
For TCP, using the traditional "read" semantics (that may return any smaller number) sidesteps the need for this -- but at the cost that the application needs the ability to carry its processing state across the suspension point (be it async or blocking) between reception of the first and the last part of a "message" -- so TCP would not need this scatter-gather output. (And doing that persistence in the application might possibly even have adverse effects on the backpressure propagatation, but I'm not sure there).
But once we go for zero-copy, we'll need to have a UDP story anyway, so the "read exactly" output would just follow the pattern established there (because in essence it's the same thing: it's reading a full message, and not a number of bytes, from a socket). So it doesn't really make things harder for the zero-copy future.
(The approaches of get-number-of-pending-bytes and the read-from-socket-but-acknowledge-separately do not need this argument, but might still be on the table.)
A concrete proposal for a trait now lives in https://gitlab.com/chrysn/embedded-nal-tcpextensions, along with a wrapper that gives a fixed-sized buffer for any TcpClientStack implementation. (Not that I'd particularly recommend doing that, as I still maintain that most TCP stacks should manage to implement this without extra buffering, but it's convenient transitionally).
Using this, I noticed that there is the analogous issue with sending, needing a send_all
(or nothing). The application is the same: A library building on nb can not rely on next being called on the same socket with just the same parameters, so once having sent some but not all bytes, nb pretty much falls apart. (The library has committed to some course of action by writing some bytes, and would need to persist all its state to finish things, but that's not how things should be done in nb). Just like there is the option to have "get number of pending bytes" for reception, "get number of bytes in the send buffer" would do to replace send_all
. (And it may be more convenient to use: With send_all
, I need to put everything into one buffer, with wait_for_sendbuffer
I could just send things piecemeal knowing it won't WouldBlock).
[edit: Updated proposal URI]
The proposal has been extended to also include a send_all
, along with the hard-buffered implementation.
Implementing the CoAP server using this I noticed that doing good stateless nb is not trivial even with these, and my implementation doesn't yet make perfect use of the new features -- still, it all appears to be in working shape. Things would be easier if we had functions that WouldBlock while there is not enough space in the buffer to send n bytes (without actually trying to send them yet), or peeking on the read side, but both are probably harder for stacks to implement; especially, I'm not sure the latter is even possible on Linux let alone on std::net.
While I am rather sure that the current design (at least read-side) works with nb,
If we go with async (#6), a send_all
(or receive_exact
) will not cut it any more: It would work (as in "can be implemented", even more easily than now), but it'd mean that the application's buffer (the argument, which is sent from or received into) is held across a suspension point. That's detrimental no matter whether the application buffer is on the stack or just a locked shared buffer: In the former case, the "local" variable would wind up in the task state enum; in the latter case, it would keep the lock across a suspension point.
My plan to go on depends on how #6 is viewed here:
If there is indication that we'll want to port to async, I'd rather start a next trait (maybe TcpPredictiveClient
) with awaitable recvbuf_has(n)
and sendbuf_ready_for(n)
functions. For anything that implements this new trait, the old one can be implemeted trivially (even automatically if we had specialization).
The new trait would be harder to implement on std (a generic implementation may even need to allocate) -- but if it's implementable well on embedded targets (which don't carry around the POSIXish not-very-CoW not-very-async assumptions of the OSs that back std), maybe that's not too big of an issue.
I'd highly appreciate any feedback on the way forward in here.
This is partially resolved in the -async version now: A TCP connection implements embedded_io::asynch::Read, which provides a read_exact method.
(Unlike in sync and nb, this can simply be provided because the buffer to read into is available across polling points; things would only get tricky again in a zero-copy version thereof).
It might be helpful to have a
receive_exact
method, similar to stdio's read_exact. That method would block if some (but not enough) bytes are available to be read.Why
Such a method would make it easy to implement TCP based message exchange protocols that have some kind of initial length description. Per connection, the application would only need to store the previously read header (indicating the length, and possibly the type), and could then wait for the rest to arrive. Per-connection state would be size of a Stack::TcpSocket plus a few bytes -- rather than having to keep a maximally sized buffer around per connection.
Why not
Providing receive_exact pushes the onus of buffering that data to the network stack. Most can probably[1] provide a function like that easily, but what is the consequence of programs actually exercising it?
If 4 connections are open, each receive simultaneously a short TCP segment that indicates that 1KiB of data is coming up, then a stack with 2KiB network buffer is in no position to serve any of them, and all starve.
Then again, such locking situations can probably also come up without this already, eg. if an application sees traffic on one stack, decides to plunge in there, only polls that socket until the message is there, but that never arrives because the other peers were a tad faster and filled up the network stack's buffers that now don't get drained b/c the application is set on processing this connection first.
[1]: I don't know sufficiently many well enough to really say.
Alternatives
A function to get the number of pending bytes, or a pair to peek at data but only acknowledge it separately, would have the same positive and negative effects. On some stacks (eg. on std), only implementing
receive_exact
(but not peeking or count-peeking) would be implementable easily.This doesn't really change when async (#47) is added: Sure in async one can just (stack or slab) allocate a receive buffer and read out piecemeal, but depending on how the corresponding tasks are allocated, this brings us either back into having a full receive buffer for every connection of "Why" (if there's a task per connection and all tasks are fully allocated), or into the commit-and-starve situation of "Why not" (because there's a large task running and the others can't be started until that one is done).
Steps forward
I'd hope for some discussion here informed by different implementers' experiences before the technical points (bump the trait? Make it runtime-fallible with a default impl that says it doesn't work? Add TcpClientStackWithReadExact?) need to be addressed.