Consider setting TCP_QUICKACK instead of TCP_NODELAY when possible

TCP_QUICKACK disables delayed acknowledgements (one of the more problematic parts of the implementation of Nagle's algorithm) but keeps small packet buffering, whereas TCP_NODELAY disables both (reducing throughput on small writes).

As put by Nagle himself:

Turning on TCP_NODELAY has similar effects, but can make throughput worse for small writes. If you write a loop which sends just a few bytes (worst case, one byte) to a socket with "write()", and the Nagle algorithm is disabled with TCP_NODELAY, each write becomes one IP packet. This increases traffic by a factor of 40, with IP and TCP headers for each payload. Tinygram prevention won't let you send a second packet if you have one in flight, unless you have enough data to fill the maximum sized packet. It accumulates bytes for one round trip time, then sends everything in the queue. That's almost always what you want. If you have TCP_NODELAY set, you need to be much more aware of buffering and flushing issues.

None of this matters for bulk one-way transfers, which is most HTTP today. (I've never looked at the impact of this on the SSL handshake, where it might matter.)

Short version: set TCP_QUICKACK. If you find a case where that makes things worse, let me know.

Unfortunately, when you search online you may find that TCP_QUICKACK is Linux-only. This is not true! Doing a setsockopt with option 12 (the same define as on Linux) under Windows works (although if it does something I don't know), at least on Windows 10. Additionally, it's seemingly not needed on OS X at all ("on my OSX MacBook Air however the RPC call needed only 3ms!").

However a second issue arises: TCP_QUICKACK can turn itself off. The solution to this is seemingly turning it back on after every recv call.

See also: https://github.com/urllib3/urllib3/issues/746, and this RFC.

This is a fascinating corner of TCP and I appreciate your bringing it to our attention! I spent some time reading through the resources you linked to.

As I understand it, TCP_QUICKACK controls acknowledgement delays, while TCP_NODELAY controls sending delays. Trio can stop using TCP_NODELAY if its peer sets TCP_QUICKACK, but Trio has no way to control that, and indeed most potential peers probably don't set TCP_QUICKACK. If Trio doesn't set TCP_NODELAY, and Trio's peer doesn't set TCP_QUICKACK, then we're opening ourselves up to the situation described in https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-little-about-tcp/ with the long delay. So I don't think it would reduce Trio user frustration on average if Trio were to stop setting TCP_NODELAY, regardless of what we do about TCP_QUICKACK.

Which leaves the question: should we be setting QUICKACK in addition to NODELAY? I guess it could improve things in the opposite direction, for peers that don't set NODELAY. Having to reenable it after every receive seems super heavyweight, though. Are there any specific cases that we expect will come up at least semi-frequently where QUICKACK would make a big difference?

Are there any specific cases that we expect will come up at least semi-frequently where QUICKACK would make a big difference?

I can't think of situations where both NODELAY and QUICKACK would be specifically useful together; but I can think of situations where turning on QUICKACK alone unconditionally would be. For example, websockets would benefit from disabling NODELAY due to the relatively small size of control frames (such as ping/pong) and probably relatively small data frames are more common. However this loses a lot of performance without QUICKACK if any sort of network synchronisation is required. I also can't think of any downside to setting QUICKACK if NODELAY is enabled.

I believe that the cost of a setsockopt call is probably lower (in the nanoseconds to microseconds, on modern computers) vs the cost of delayed acks (in the hundreds of the miliseconds in the worst case) that i would be an acceptable trade-off.

This is a fascinating corner of TCP and I appreciate your bringing it to our attention! I spent some time reading through the resources you linked to.

Definitely! I'd never even heard of QUICKACK before.

I'm definitely not an expert in all the low-level details of TCP, so please check me on this. But I actually think NODELAY is generally what you want, even if QUICKACK is also enabled? My reasoning:

Nagle's algorithm says that if you have a partial packet ready to send, then don't actually send it until the previous packet is ACKed.

So, say our packet size is N. And imagine a peer that sends a message that's 1.5 * N bytes big, then waits for the other side to process that message and send a response. (For example, a "message" here might be a websocket frame, or a TLS frame, or an HTTP request.)

If Nagle's algorithm is enabled, then we send the first packet, wait for it to be ACKed, and then send the second half-full packet. Then once the other side has received the full message, it can process it and send back its response. The whole process takes 2 round trip times (one for the ACK + one for the response), plus however long the other side waits before sending the ACK.

The problem that Nagle talks about in the ycombinator post is that if you have delayed ACKs enabled, then that last component becomes large, which is obviously bad for our overall latency. Setting QUICKACK makes the ACK sending delay zero, so our whole process only takes 2 round trip times.

However, if we set NODELAY, then both packets are sent immediately, and the other side gets the whole message at once, and can reply immediately. So now it only takes 1 round trip time.

On a low-latency connection like localhost or in a data center, the ACK delay is much larger than the round trip time, so it doesn't really matter whether you do 1 or 2 round trips, but it really matters that you don't wait for the ACK.

OTOH, on a high-latency connection that goes over the internet, round-trip times can easily dominate application performance. So going from 2 round trips to 1 round trip is a huge deal, and you definitely want NODELAY, regardless of whether QUICKACK is set or not.

OTOH, if NODELAY is set, then I think QUICKACK doesn't matter too much? In this specific situation, I think it actually makes things slightly more efficient. With QUICKACK enabled, the other side has to ACK the first request packet, then ACK the second request packet, then send more packets with the response. With delayed ACK enabled, then the first two ACKs get delayed, and then when we send the response the ACKs can piggy back on that for free. Probably not a huge deal either way in practice, but the point is QUICKACK isn't really helping any.

So NODELAY seems to be helpful in at least some situations. But everything has tradeoffs, so what are the downsides? The main one is that it forces application code to buffer up complete messages in userspace and pass them to the kernel in a single chunk, instead of calling write() a bunch of times to compose a single message. Nagle is more forgiving of sloppy application code. But OTOH:

Applications are always in a better position to make decisions about buffering than the kernel, because they have more information about what's going on
Interactive duplex byte streams are just inherently a super tricky and unforgiving abstraction; there's no API that can hide the need to think carefully about message framing, backpressure, deadlock hazards, etc. Normally Trio tries to be forgiving and make things easy for users, so you'd expect us to leave Nagle's algorithm on by default. But IMO for the audience of people writing raw network protocols, Nagle's algorithm just makes the overall behavior harder to predict or reason about, so it makes a hard job harder.
Most protocols these days want to run over TLS, either always or at least some of the time. And TLS has the same tinygram problem as TCP: each time you call write(), the TLS layer has to wrap the data you passed in a header, a MAC, etc. And TLS effectively always has NODELAY enabled, because your TLS library generally can't play tricks with timers the way an OS kernel can. So if you want your protocol implementation to work well over TLS, you basically have to write it to work well over NODELAY sockets too.

So that's why Trio's main TCP interfaces unconditionally enable NODELAY. (Though we do still give the option of dropping down to the raw socket layer and then you can setsockopt whatever you want.)

I suspect this is also why most of the networking libraries I've looked at enable NODELAY by default, and none of them enable QUICKACK, and why OSes haven't put the effort into making QUICKACK widely available or easy to use? But that's just a guess.

python-trio / trio

Consider setting TCP_QUICKACK instead of TCP_NODELAY when possible #1792