Sometimes it's important to check for socket writeability before trying to write

njsmith commented 7 years ago

I recently discovered that Linux/OS X provide an important API (TCP_NOTSENT_LOWAT) that lets applications avoid queuing up excessive data inside the kernel's socket send buffers. (The socket send buffers are generally too big, for various reasons.) Unfortunately, it turns out that this API works by controlling when a socket is marked writeable by select and friends, but does not affect whether a send call will succeed, so while you might think these are the same thing they actually aren't. [Edit: it turns out that this description is actually incorrect on Linux, though probably true on macOS -- see] I initially filed a bug on curio about this because curio was assuming they were the same, so I won't repeat all the details: https://github.com/dabeaz/curio/issues/83

@dabeaz points out that asyncio seems to make the same invalid optimization, so filing a bug here too.

njsmith commented 7 years ago

On further discussion (see the curio issue), it sounds like the tentative conclusion is:

There should be some way to specifically wait for a socket / StreamWriter to become writeable. (rationale: https://github.com/dabeaz/curio/issues/83#issuecomment-254103790)
For bonus points, it actually probably makes sense to enable TCP_NOTSENT_LOWAT on TCP sockets by default to get better buffering behavior.

gvanrossum commented 7 years ago

At the lowest level in asyncio (i.e. if you have a socket) if you just start sending, loop.sock_sendall() you will indeed be hit by this if the optimization misfires. But an app can work around this using loop.add_writer().

At the next level you have a Protocol/Transport pair, which has a synchronous write() method that contains this optimization. That's not so easy to work around at the app side, but there is a Protocol API that could be used for this: pause_writing()/resume_writing(). We can probably change the default transport implementation so that it uses these more aggressively, without API changes.

asyncio streams are built on top of Protocol/Transport pairs, so at that level we should be able to benefit from whatever we do for the previous level.

@glyph has this reached Twisted yet?

PS Jim Gettys has been complaining about this for years. Glad something's finally being done about it. And @njsmith, thanks for the clear explanations!

gjcarneiro commented 7 years ago

I guess this is trying to address the buffer bloat problem?...

njsmith commented 7 years ago

@gjcarneiro: bufferbloat is a many-headed hydra, but yeah, this is about bufferbloat in the context of per-socket send buffers specifically. The discussion thread on the curio issue has lots more details.

glyph commented 7 years ago

@glyph has this reached Twisted yet?

Not TCP_NOTSENT_LOWAT, no. I'm sort of curious how our producer/consumer API interacts with this detail; I have a feeling it'll behave correctly, but I'm not entirely sure.

However, in the process of investigating this, I learned that we apparently removed the eager-write optimization many years ago:

https://github.com/twisted/twisted/commit/c75d1eb93c914f1f95567a76e1ba6c0166a7eee3

Digging into the history and viewing some of the discussion around that time, it seems that we were aware that it punished us pretty brutally on certain micro-benchmarks, but there's no realistic benchmark we could find where it impacts performance significantly. @dabeaz points out over on the other ticket that it's a massive performance penalty to an echo-server benchmark, and that's true; however, echo is not a realistic application.

If you want to do anything interesting you need to talk to at least one other back-end service, which means that you need to carefully manage the relationship between two transports, which means you need a producer/consumer hookup. Once you have that, you can't really get the meat of the optimization that eager-writes give you, which is the ability to avoid the extra select/epoll/kqueue(etc) syscall between recv and send, since you need to go back to the main loop to see if it's time to read again between each packet anyway.

It also does punish the writer on benchmarks where you are synthesizing data on the CPU rather than getting it or processing it from a different remote source, but /dev/urandom as a service also has pretty limited utility.

That said, I don't think Twisted is a great model to look towards for good support for tunables; tuning has historically been a weak point for us, because users who have significant performance demands almost always end up fixing them by making scaling up and down easier rather than optimizing throughput. Also, the only application where this sort of tuning tends to make any difference is something that is just shuttling around huge volumes of data without really processing it, and if you're doing that you're more likely to use HAProxy or something.

That said, I really appreciate learning about this nuance of send on linux. Hopefully at some point in the coming year we're going to do an overhaul of how we deal with tunable transport parameters (mostly focused on the more-portable SO_SENDBUF and SO_RECVBUF than this platform-specific detail) and it'll be good to keep it in mind for that.

Lukasa commented 7 years ago

I should note that I have an interest in adding support for TCP_NOTSENT_LOWAT into Twisted because it's highly-valuable for HTTP/2, where it's extremely valuable to keep send buffers small if possible to prevent control frames getting blocked behind buffered stream data. That means that support for APIs of that kind is likely to want to be something asyncio provides as well.

However, I disagree with @njsmith's assertion that asyncio just wants to start using it by default. In particular, for bulk unframed data transfers where throughput is more important than reactivity, applications will want to avoid spinning up the Python event loop wherever possible: for that reason, large writes are ideal and using TCP_NOTSENT_LOWAT with a bad value will have nasty negative performance impacts. The biggest case of this is for protocols like FTP and HTTP/1.1, particularly when sendfile is not available to the application, where we want to free the event loop up to do other things rather than repeatedly send smallish writes into the kernel.

In the worst-case of a 100% CPU-utilisation event loop, aggressively low values of TCP_NOTSENT_LOWAT can lead to pauses in data transfer because the event loop isn't able to respond to the POLLOUT event before the kernel send buffer empties entirely.

It is much better for asyncio to expose this kind of tuneable rather than opt-into it by default. Let application developers decide what the performance characteristics of their protocols should be.

njsmith commented 7 years ago

Ah, but that can be handled by the library too. On OS X, the splitting of large writes isn't an issue at all, since TCP_NOTSENT_LOWAT only affects select-and-friends, not send-and-friends. And in Linux, you can achieve the same effect by having your send routine do: (1) turn off TCP_NOTSENT_LOWAT, (2) call send, (3) turn it on again. The basic intuition here is that you want to let the send buffer drain before signaling writeability to avoid standing buffers, but once the application has committed to sending a large chunk of data, you want to hand that off to the kernel as quickly as possible, even if that does temporarily create a large buffer. . I agree that the actual TCP_NOTSENT_LOWAT value should be tuneable, and that this is a somewhat experimental proposal. But theoretically at least it seems like there are some pretty compelling arguments that the best default value for TCP_NOTSENT_LOWAT is smaller than the "infinity" we currently default to.

On Nov 15, 2016 04:44, "Cory Benfield" notifications@github.com wrote:

I should note that I have an interest in adding support for TCP_NOTSENT_LOWAT into Twisted because it's highly-valuable for HTTP/2, where it's extremely valuable to keep send buffers small if possible to prevent control frames getting blocked behind buffered stream data. That means that support for APIs of that kind is likely to want to be something asyncio provides as well.

However, I disagree with @njsmith https://github.com/njsmith's assertion that asyncio just wants to start using it by default. In particular, for bulk unframed data transfers where throughput is more important than reactivity, applications will want to avoid spinning up the Python event loop wherever possible: for that reason, large writes are ideal and using TCP_NOTSENT_LOWAT with a bad value will have nasty negative performance impacts. The biggest case of this is for protocols like FTP and HTTP/1.1, particularly when sendfile is not available to the application, where we want to free the event loop up to do other things rather than repeatedly send smallish writes into the kernel.

In the worst-case of a 100% CPU-utilisation event loop, aggressively low values of TCP_NOTSENT_LOWAT can lead to pauses in data transfer because the event loop isn't able to respond to the POLLOUT event before the kernel send buffer empties entirely.

It is much better for asyncio to expose this kind of tuneable rather than opt-into it by default. Let application developers decide what the performance characteristics of their protocols should be.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/python/asyncio/issues/446#issuecomment-260631464, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlOaH3KMPWd80cI2CB8X2cdsddhk99Oks5q-akDgaJpZM4KX5Ye .

python / asyncio

Sometimes it's important to check for socket writeability before trying to write #446