[Feature]: allowing partial receive (short read on timeout or read current available)

jadeblaquiere commented 5 years ago

In implementing the socks5proxy example I ended up starting two concurrent coroutines to forward traffic (one from local to remote and one from remote to local) for each connection through the proxy. The implementation is:

coroutine void forward(int in_s, int out_s, int ch) {
    while (1) {
        uint8_t d[1];
        if(brecv(in_s, d, 1, -1)) break;
        if(bsend(out_s, d, 1, -1)) break;
    }
    chdone(ch);
    return;
}

Coroutines seem to me to be an simple (and elegant) solution to the problem of handling multiple full-duplex connections. Each forward() coroutine is handling one half-duplex connection (and channels work nicely enough to signal the connection is closing and close accurately). The proxy is, by design, ignorant of the protocol stream so it cannot know how many more bytes are coming or predict whether the traffic is coming next from local or remote. All it knows is that if something comes in one side it goes out the other.

Unfortunately, this code strikes me as fairly inefficient as all reads/writes are a single byte. Calling brecv() on a tcp socket leads to a handle lookup (maybe cached) and then eventually passes down to read the underlying fd with recvmsg().

recvmsg() will happily return whatever data is available up to the length you request. However, the stated semantics of brecv() indicate "There is no such thing as partial receive. If a problem, including timeout, occurs while receiving the data, an error is returned to the user and the socket cannot be used for receiving from that point on" and the implementation in tcp.c follows that semantic.

There seems to be some intention to the design which forgoes short reads. Often making such assumptions significantly simplifies handling of corner cases. I'm curious what the logic behind is. It would seem to me that if short reads were supported the following could be far more efficient (as incoming data is likely coming in packets anyway, potentially even significantly larger):

coroutine void forward(int in_s, int out_s, int ch) {
    while (1) {
        uint8_t d[4096];
        if(sock_fdin(in_s, -1)) break; // wait for socket to become readable 
        s = brecv(in_s, d, 4096, 0); // read all the available data (up to buffer size), 0 deadline
        if((s < 4096) && (errno != EWOULDBLOCK)) break; // exit on error other than empty
        if(bsend(out_s, d, s, -1)) break; // write all the bytes read
    }
    chdone(ch);
    return;
}

What do you think?

sustrik commented 5 years ago

It's actually the central tenet of libdill design: It tries to avoid both callbacks and state machines. Once you allow for partial reads or writes, you have to keep state of where in the stream you are, how many bytes are yet to be read etc. And you do that either back to callback hell or state machine hell.

If you really want a super-optimized solution you can still use raw fdin() and fdout() + partial POSIX reads/writes.

jadeblaquiere commented 5 years ago

I like how libdill treats all bytestreams the same. It is a good abstraction. The underlying transport could be a plain TCP connection or it could be a TLS stream. If you go back to the base file desciptors (for a raw TCP connection) then you have completely different logic for TLS/OpenSSL. libdill makes tunnelling easy.

Avoiding callbacks and state machines is a solid design principle. After thinking about it over the weekend I'm not yet convinced that I would throw managing stream position in with state machines generally... but there is definitely state involved.

The worst case to me would be if the interface forced programs (users) to implement their own buffering. If the default was to allow short reads (as it is for the POSIX recv() call unless you add the flag MSG_WAITALL) then that would force programs to be tolerant of partial results. However there is a significant difference in forcing something and allowing it to be a choice.

What I do like about the concept of deadlines is they give the program a choice in how to bound timeliness. The program can choose to allow an indefinite delay (deadline = -1), a bounded delay (deadline = now() + value) or no delay (deadline = 0).

Having a timeout as a fatal error limits the utility of that concept (and actually makes a choice for the program, i.e. to respond to a timeout by aborting).

So here's my proposal: let the program choose how to handle timeouts:

Internally, treat a timeout as a non-fatal error. Subsequent operations are allowed.
Depending on the value of deadline you have 3 behaviors: a) deadline = 0: non-blocking read, returns data currently available (potentially short) up to N bytes, if short, return count and set errno = EWOUDLBLOCK - program can treat this as fatal and close the connection, or not. b) deadline = value: if deadline is reached without filling buffer, return count and set errno = ETIMEDOUT - program can treat this as fatal and close the connection or not. c) deadline = -1; no short reads allowed. either returns count=n or returns -1 and errno = ECANCELLED/ECONNRESET. (these are fatal)

Internal to libdill this would mean the implementations would need to handle these short read conditions and where appropriate pass back data and/or signals to indicate partial data (something like... if you got half a message, buffer the half you received and pass back EWOULDBLOCK).

If this is too much of a change... do you see a good way to make timeouts non-fatal?

sustrik commented 5 years ago

Imagine implementing one of the socks5_connect functions so that they support non-destructive timeouts.

When the timeout is reached you would have to save the current state of the protocol ("reading field no. 3, we already have 5 bytes") and then be able to restart it from that point. So you are back to state machines.

If, on the other hand you would not allow restart, then you are basically introducing undefined behaviour. When the timeout happens the user has no idea how much of the protocol was already executed and thus any subsequent attempt to use the socket would lead to weird behaviour. Only sane thing to do in such situation is to close the socket, which is exactly what libdill does at the moment.

jadeblaquiere commented 5 years ago

I agree that for a multi-step/handshake protocol stopping in the middle and saving state would be either complex or fragile or likely both.

This is where I think you can draw a reasonable line between handling/managing stream position vs. multiple-steps in a state machine. Restarting a state machine implies returning to the previous state (to state the obvious). Restarting at a previous stream position works pretty much the same as starting with an empty buffer, you just read data from the stream.

So here's where I would draw the line:

If you timeout in brecv, short reads are allowed. If there is a timeout then the socket could still support subsequent reads.

---- the line ----

If you timeout on a brecv call inside of socks5_client_connect then it's up to socks5_client_connect to determine if the timeout is considered fatal (and in this case the recovery would be complex so it could simply close the socket and fail). subsequent reads from the stream should fail.

For internal consumers (libdill components) of the brecv API, the library determines how they will handle and if a timeout is a fatal condition for the protocol.

For external consumers, it's a choice how short reads from brecv are handled.

is this just too much of a slippery slope?

sustrik commented 5 years ago

Yes, I think it's a slippery slope. Imagine a protocol that interleaves unstructured data with keepalives. From the outside point of view it would look like a regular bytestream protocol, but internally it would have to have a state machine to track the state during when the timeout expires.

But let's get back to basics: Why would you want partial reads in the first place? The only reason I can think of is manual network scheduling. And that's counter the the design of libdill which is supposed to do scheduling for you.

jadeblaquiere commented 5 years ago

What got me started down this path was having a situation where you don't know in advance how many bytes are coming. In the application I'm working on I'm tunneling a tcp stream through one (or potentially a few) SSL connections. Endpoint nodes have some visibility to the protocol but midpoint nodes are intentionally unable to determine what kind of messages are being passed. Midpoint nodes just read from one SSL stream and write to another.

libdill provides a very elegant solution to tunneling protocols (They're bytestreams, after all), but reading and writing one byte at a time is not particularly efficient.

And while simply having an unknown size got this started, it got me thinking about what "deadline" means... but perhaps just start with "I don't know how many bytes are coming but if there are a bunch of them I'd like to drain the input stream and be able to write a big chunk to the output stream".

sustrik commented 5 years ago

I see, so it boils down to performance. Have you run any performance tests? Like, for example, comparing reading a lot of data byte-by-byte compared to reading it in 256 byte chunks.

jadeblaquiere commented 5 years ago

I wrote a simple benchmark and there did seem to be a significant difference but in fairness I really need to write a more rigorous test to be confident in quantifying the impact. I'll take another whack at that... stay tuned

jadeblaquiere commented 5 years ago

So I cleanup up my benchmark (see: https://github.com/jadeblaquiere/libdill/tree/proxy_bench/benchmark) and here are the results (for reading 16M at different buffer sizes from source to sink via proxy):

proxy: 4096 blksz, 4096 blocks, elapsed time 0.044000, cpu time 0.038689 proxy: 2048 blksz, 8192 blocks, elapsed time 0.052000, cpu time 0.051185 proxy: 1024 blksz, 16384 blocks, elapsed time 0.064000, cpu time 0.061592 proxy: 512 blksz, 32768 blocks, elapsed time 0.088000, cpu time 0.090512 proxy: 256 blksz, 65536 blocks, elapsed time 0.196000, cpu time 0.195551 proxy: 128 blksz, 131072 blocks, elapsed time 0.288000, cpu time 0.284203 proxy: 64 blksz, 262144 blocks, elapsed time 0.489000, cpu time 0.487642 proxy: 32 blksz, 524288 blocks, elapsed time 1.110000, cpu time 1.106973 proxy: 16 blksz, 1048576 blocks, elapsed time 2.505000, cpu time 2.489836 proxy: 8 blksz, 2097152 blocks, elapsed time 5.060000, cpu time 5.051713 proxy: 4 blksz, 4194304 blocks, elapsed time 10.001000, cpu time 9.957115 proxy: 2 blksz, 8388608 blocks, elapsed time 19.632000, cpu time 19.560156 proxy: 1 blksz, 16777216 blocks, elapsed time 38.754000, cpu time 38.656531

So, reading (and writing) one byte at a time seems to be pretty inefficient, assuming I got the measurement right. Measuring accurately can be a challenge sometimes.

sustrik commented 5 years ago

That's quite a difference!

I've been thinking of how to support this scenario within breaking the main principle behind libdill (no callbacks & no state machines). What it boils down to, I think, is to make the "correct" approach easy and the "ugly but prformant" approach hard and time consuming. In other words, I want to incentivize people to use the correct way and that means making the incorrect way costly.

One solution that comes to mind is allowing to detach the raw socket from the libdill socket. That way you could do the initial handshakes the standard way, then detach the socket and use standard POSIX API afterwards.

The function would return the underlying socket as well as any data cached by the intermediate layers:

int tcp_detach(int s, void rxbuf, size_t rxbuflen, void txbufm, size_t txbuflen);

That being said, I see some possible problems with the approach:

3rd party layers (e.g. OpenSSL) may not allow this kind of detachment, which would make the entire thing much less useful.
If we wanted this to be universally usable, each protocol would have to implement it, which sucks for the protocol implementers.

jadeblaquiere commented 5 years ago

To me the big nail in the coffin is number 1. libdill makes layering TLS essentially transparent. If you're back to interacting with the underlying socket you've thrown out much of the goodness.

I would think this also brings a burden of interacting with coroutine scheduling as you're bypassing the implicit scheduling in the IO layer so you need to be calling fdin and fdout, right?

sustrik commented 5 years ago

Ok, another option is to provide a third interface (in addition to bsock and msock), something like "partial bytestream". This could be implemented by TCP and TLS protocols. It wouldn't be necessarily implemented by more complex bytestream protocols.

jadeblaquiere commented 5 years ago

So in this model tcp, tls (and any other protocol which chose to) would present both interfaces, right?

If I understand, something like: brecv -> returns 0 on success, fails/closes on timeout precv -> return LEN on success, returns partial length (>= 0) on timeout (or deadline = 0)

looking at it a bit it seems the changes have to happen at the fd_recv layer first... I'm willing to give it a try (or help)...

sustrik commented 5 years ago

Yes, that's the idea. We could have something like bbatch interface that would allow for partial reads/writes. The functions to call it (bbatchsend, bbatchrecv) could check whether the socket passed into them has bbatch interface and if so, they could use it. If not so, they could still do the same thing in a one-byte-at-a-time way.

Btw, another optimization use case would be "read bytes until particular character is encountered". This would make protocols like CRLF much faster.

n0ndual commented 5 years ago

tcprecv(s, buf, 1, -1) tcprecv(s, buf + 1, sizeof(buf) -1, now())

Originally posted by @sustrik in https://github.com/sustrik/libmill/issues/48#issuecomment-125115721

@sustrik I found a similar issue in Libmill repo. As you have said, this approach is quite intuitive because it makes user articulate what they want clearly: how many bytes to unconditionally wait for. Deadline for those, Deadline for any additional bytes.

So I dug a little into the source code, and try to make brecv act similarly as tcprecv in Libmill.

static int dill_tcp_brecvl(struct dill_bsock_vfs *bvfs,
      struct dill_iolist *first, struct dill_iolist *last, int64_t deadline) {
    struct dill_tcp_conn *self = dill_cont(bvfs, struct dill_tcp_conn, bvfs);
    if(dill_slow(self->rbusy)) {errno = EBUSY; return -1;}
    if(dill_slow(self->indone)) {errno = EPIPE; return -1;}
    if(dill_slow(self->inerr)) {errno = ECONNRESET; return -1;}
    self->rbusy = 1;
    // dill_fd_recv is modified: if deadline is reached, rc is the count of the receved.
    int rc = dill_fd_recv(self->fd, &self->rxbuf, first, last, deadline);
    self->rbusy = 0;
    if(dill_fast(rc == 0)) return 0;
    if(errno == EPIPE) self->indone = 1;
    // add this line
    else if(errno == ETIMEOUT) return rc;
    else self->inerr = 1;
    return -1;
}

Would this work? I learned from your discussion that this approach doesn't fit into the Libdill philosophy. However, is this simple modification safe? Can I use it in my little project?

yami commented 3 years ago

What's the latest progress of this issue? It is common that what to parse size is not known before parsing. For example, if I want to parse memcached's text protocol, it seems not easy (possible?) to done.

sustrik / libdill

[Feature]: allowing partial receive (short read on timeout or read current available) #171

Originally posted by @sustrik in https://github.com/sustrik/libmill/issues/48#issuecomment-125115721