Send WindowUpdate message early

Flow control in Yamux and HTTP/2

The Yamux flow control mechanism is very similar to HTTP/2's flow control. This is to no surprise, given that Yamux is inspired by the early SPDY efforts.

In both Yamux and HTTP/2 the WindowUpdate message is the integral part of the flow control mechanism.

To prevent the streams from stalling, window update frames should be sent regularly. Yamux can be configured to provide a larger limit for windows sizes.

https://github.com/hashicorp/yamux/blob/master/spec.md#flow-control

Flow control is based on WINDOW_UPDATE frames. Receivers advertise how many octets they are prepared to receive on a stream and for the entire connection. This is a credit-based scheme.

https://tools.ietf.org/html/rfc7540#section-5.2.1

In HTTP/2 it is up to the receiver when to send a WindowUpdate message. If I understand the short Yamux specification correctly, the same applies to Yamux.

HTTP/2 defines only the format and semantics of the WINDOW_UPDATE frame (Section 6.9). This document does not stipulate how a receiver decides when to send this frame or the value that it sends, nor does it specify how a sender chooses to send packets. Implementations are able to select any algorithm that suits their needs.

See https://tools.ietf.org/html/rfc7540#section-5.2.1

For a general overview of HTTP/2 flow control I can recommend "HTTP/2 in Action" [1]. Chapter 7.2 on the topic can be (pre-)viewed on the publishers website.

WindowUpdate message strategies

HTTP/2 implementations can use WindowUpdate messages to implement various (possibly advanced) flow control strategies. One example of a simple strategy is the nghttp2 library which sends a WindowUpdate message once it (receiver) has received and consumed more than half of the flow control window.

Today with WindowUpdateMode::OnRead this Yamux implementation sends a WindowUpdate message once (a) the window credit has been fully depleted and (b) the read buffer is empty, thus all bytes have been consumed. See implementation for details.

Comparison

Imagine the following simplified scenario:

A sender S is communicating with a receiver R. S wants to send 1 MB in multiple chunks to R. R uses a receive window of 256 KB (Yamux default). S and R are connected on some network inducing both a delay and a bandwidth constraint.

Algorithm 1 (Yamux today): Send WindowUpdate message once (a) window is fully depleted by sender and (b) receiver has consumed all send buffered bytes.

Once S has depleted its window, having sent 256 KB to R, S is blocked and has to wait for a WindowUpdate message to be send by R. This message is only send by R once 256 KB have been received and all of those are consumed. Thus every 256 KB the sender is blocked, having to wait for a whole round-trip (for all its data to arrive as well as for the WindowUpdate to be received).

Algorithm 2 (nghttp2): Sending WindowUpdate message once half or more of the window has been received and consumed.

While S sends the first 256 KB of data, R receives and consumes chunks of that data in parallel. Instead of waiting for the window to be fully depleted by S and fully consumed by R, R already sends a small WindowUpdate message (# bytes consumed of the current window) once half or more of the window has been depleted and consumed. This WindowUpdate will likely arrive at S before it depletes the entire window and thus S never stalls.

Summary

Long story short, to prevent senders from being blocked every time they have sent RECEIVE_WINDOW (256 KB) number of bytes, I am suggesting Yamux to adapt the same WindowUpdate strategy as nghttp2, namely to send a WindowUpdate message once half or more of the window has been received and consumed. An early benchmark with a patched Yamux using an adapted version of @twittner bench tool looks promising. I still have to do more testing on a high-latency network.

What do people think? Let me know if I am missing something. Happy for any pointers to similar discussions in the past.

[1] Pollard, Barry. HTTP/2 in Action. Manning, 2019.

So intuitively, the idea is that sending (partial) window updates should compensate to some degree for senders having to pause due to transmission delays, is that right? Since if the transmission delay is negligible, the sender pretty much gets the window update as soon as the current window is exhausted (assuming the reader can consume data at the same pace as it is sent and received). What exactly do you have in mind in terms of implementation? My first thought would be to introduce a numeric factor to the configuration that influences the threshold for window updates w.r.t. the current receive window, and which is currently fixed to 0. I think it would apply equally to both of the existing WindowUpdateModes. In any case, curious to see what your benchmarks show, especially if you manage to simulate different network latencies. I couldn't find any particular mention of the reasoning behind the threshold choice in nghttp2, so I just assume they made that decision based on their own benchmarks.

So intuitively, the idea is that sending (partial) window updates should compensate to some degree for senders having to pause due to transmission delays, is that right?

Correct.

What exactly do you have in mind in terms of implementation?

Thus far I have only tested out the naive approach of sending the WindowUpdate after half of the window has been exhausted, hoping for the WindowUpdate to arrive at the sender before the sender exhausted the other half. I have not yet measured the overhead of potentially sending twice as many WindowUpdate messages back to the sender.

My first thought would be to introduce a numeric factor to the configuration that influences the threshold for window updates w.r.t. the current receive window, and which is currently fixed to 0.

Off the top of my head I am reluctant to introduce yet another tunable. I find todays configuration surface, while small, already difficult to get right . That said, I have not yet explored the impact different numeric factors would have.

I think it would apply equally to both of the existing WindowUpdateModes.

Thanks for raising this. Thus far I was under the impression with OnReceive WindowUpdate would be send on each new frame. Instead today this is happening on window exhaustion as well.

In any case, curious to see what your benchmarks show, especially if you manage to simulate different network latencies.

I haven't done enough testing, but I can already share a specific use-case which looks promising. Using the altered benchmarks from https://github.com/paritytech/yamux/pull/102 with an adsl2+ connection (20 Mbit/s, 20ms RTT), each message being 4096 bytes I end up with:

$ critcmp on-read-on-0 on-read-on-half
group                                         on-read-on-0                           on-read-on-half
-----                                         ------------                           ---------------
concurrent/adsl2+/#streams1/#messages1000     1.41       2.4±0.00s  1700.9 KB/sec    1.00   1670.6±8.29ms     2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1000    1.00      16.6±0.01s     2.3 MB/sec    1.00      16.6±0.01s     2.3 MB/sec

As one would expect, sending an early window update while concurrently using 10 channels does not have an impact on throughput. Whenever one stream exhausted its window, another stream can likely use the bandwidth of the underlying link.

Using only a single stream one can see an increase from 1.7 MBit/s to 2.3 MBit/s, thus a 35 % increase in bandwidth. Note that by sending WindowUpdate messages early using a single stream one achieves the same bandwidth as one would with 10 streams sending WindowUpdate messages either early or on 0.

For the sake of completeness, here are the numbers for all network types, again comparing WindowUpdate messages send at window exhaustion (0) vs WindowUpdate messages send on exhaustion of half the window:

critcmp on-read-on-0 on-read-on-half                          
group                                                on-read-on-0                           on-read-on-half
-----                                                ------------                           ---------------
concurrent/adsl2+/#streams1/#messages1000            1.41       2.4±0.01s  1695.6 KB/sec    1.00   1672.1±3.50ms     2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1000           1.00      16.6±0.00s     2.3 MB/sec    1.00      16.6±0.02s     2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1000          1.03     77.5±0.41ms    50.4 MB/sec    1.00     74.9±0.82ms    52.1 MB/sec
concurrent/gbit-lan/#streams10/#messages1000         1.00    744.5±6.81ms    52.5 MB/sec    1.00    747.4±6.54ms    52.3 MB/sec
concurrent/mobile/#streams1/#messages1000            1.27       7.1±0.01s   566.4 KB/sec    1.00       5.5±0.01s   721.0 KB/sec
concurrent/mobile/#streams10/#messages1000           1.00      55.4±0.08s   722.6 KB/sec    1.00      55.3±0.05s   723.5 KB/sec
concurrent/unconstrained/#streams1/#messages1000     1.00      9.4±0.57ms   417.3 MB/sec    1.05      9.8±0.43ms   398.0 MB/sec
concurrent/unconstrained/#streams10/#messages1000    1.00    100.0±3.05ms   390.4 MB/sec    1.08    108.5±4.89ms   360.1 MB/sec

The characteristics described in the previous comment show up once again for the remaining network types except for the unconstrained network type. As one would expect the unconstrained network type is slowed down by the increased number of WindowUpdate messages.

Numbers presented above are using WindowUpdateMode::OnRead. As Roman pointed out earlier, sending WindowUpdate messages early can as well be done when using WindowUpdateMode::OnReceive. I adjusted my (yet to be published) implementation to send early WindowUpdate messages in both modes. Below are the benchmark outputs (scroll to the right):

critcmp on-receive-on-0 on-receive-on-half on-read-on-0 on-read-on-half
group                                                on-read-on-0                           on-read-on-half                        on-receive-on-0                        on-receive-on-half                                                                                                                                                               
-----                                                ------------                           ---------------                        ---------------                        ------------------                                                                                                                                                               
concurrent/adsl2+/#streams1/#messages1000            1.41       2.4±0.01s  1695.6 KB/sec    1.00   1672.1±3.50ms     2.3 MB/sec    1.41       2.3±0.00s  1702.6 KB/sec    1.00   1669.0±1.24ms     2.3 MB/sec                                                                                                                                              
concurrent/adsl2+/#streams10/#messages1000           1.00      16.6±0.00s     2.3 MB/sec    1.00      16.6±0.02s     2.3 MB/sec    1.00      16.6±0.01s     2.3 MB/sec    1.00      16.7±0.02s     2.3 MB/sec                                                                                                                                              
concurrent/gbit-lan/#streams1/#messages1000          1.03     77.5±0.41ms    50.4 MB/sec    1.00     74.9±0.82ms    52.1 MB/sec    1.10     82.5±2.40ms    47.3 MB/sec    1.01     75.7±0.96ms    51.6 MB/sec                                                                                                                                              
concurrent/gbit-lan/#streams10/#messages1000         1.00    744.5±6.81ms    52.5 MB/sec    1.00    747.4±6.54ms    52.3 MB/sec    1.03   765.2±12.58ms    51.0 MB/sec    1.03   766.7±17.11ms    51.0 MB/sec                                                                                                                                              
concurrent/mobile/#streams1/#messages1000            1.27       7.1±0.01s   566.4 KB/sec    1.00       5.5±0.01s   721.0 KB/sec    1.27       7.0±0.01s   567.8 KB/sec    1.00       5.5±0.01s   721.7 KB/sec                                                                                                                                              
concurrent/mobile/#streams10/#messages1000           1.00      55.4±0.08s   722.6 KB/sec    1.00      55.3±0.05s   723.5 KB/sec    1.00      55.1±0.02s   725.7 KB/sec    1.00      55.2±0.07s   725.1 KB/sec                                                                                                                                              
concurrent/unconstrained/#streams1/#messages1000     1.00      9.4±0.57ms   417.3 MB/sec    1.05      9.8±0.43ms   398.0 MB/sec    1.16     10.9±0.30ms   360.0 MB/sec    1.20     11.3±0.28ms   347.0 MB/sec                                                                                                                                              
concurrent/unconstrained/#streams10/#messages1000    1.00    100.0±3.05ms   390.4 MB/sec    1.08    108.5±4.89ms   360.1 MB/sec    1.17    116.7±3.46ms   334.9 MB/sec    1.17    116.7±2.87ms   334.8 MB/sec

Same characteristics described above for WindowUpdateMode::OnRead seem to apply for WindowUpdateMode::OnReceive. Namely:

Large bandwidth increase in the single stream case, especially on high latency network types (adsl2+, mobile).
With early WindowUpdate messages single and multi stream case equal in bandwidth.
Only network type seeing a decreased bandwidth when sending early WindowUpdate messages is the unconstrained network type. This is likely due to the increased number of WindowUpdate messages as mentioned above.

https://github.com/libp2p/rust-libp2p/issues/1849 describes a related performance optimization proposal. As a sender, when expecting to receive a large message (say 10 MB), one could increase the receive window up-front, allowing the sender to send the entire large message in one go instead of sending chunks bounded by the default receive window.

The below benchmarks use a message size of 10MB following the example in https://github.com/libp2p/rust-libp2p/issues/1849.

First off lets look at the performance of Yamux based on the current develop branch. The receive window is left unchanged (256 KB), WindowUpdate messages are send on window exhaustion, WindowUpdateMode::OnRead is used.

critcmp on-read-on-0-with-10MB-msg
group                                             on-read-on-0-with-10MB-msg
-----                                             --------------------------
concurrent/adsl2+/#streams1/#messages1            1.00       6.1±0.03s  1684.8 KB/sec
concurrent/adsl2+/#streams10/#messages1           1.00      42.6±0.14s     2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1          1.00   270.6±27.16ms    36.9 MB/sec
concurrent/gbit-lan/#streams10/#messages1         1.00       2.8±0.07s    36.1 MB/sec
concurrent/mobile/#streams1/#messages1            1.00      16.0±0.00s   638.3 KB/sec
concurrent/mobile/#streams10/#messages1           1.00     140.8±0.24s   727.2 KB/sec
concurrent/unconstrained/#streams1/#messages1     1.00      3.8±0.10ms     2.6 GB/sec
concurrent/unconstrained/#streams10/#messages1    1.00     33.3±2.46ms     2.9 GB/sec

As already noted in previous comments, the current implementation on develop suffers a decreased bandwidth when using a single stream, given that the sender side is blocked each 256KB for a single round-trip (see adsl2+ 1.6848MB/sec vs 2.3 MB/sec) waiting for the receiver to send a WindowUpdate message.

Running the same benchmark once more with a receive window set to the message size (10MB) as suggested in https://github.com/libp2p/rust-libp2p/issues/1849 the above mentioned decreased bandwidth when using a single channel is gone (see adsl2+ 2.3 MB/sec == 2.3 MB/sec).

critcmp on-read-on-0-with-10MB-msg-with-10MB-receive-window
group                                             on-read-on-0-with-10MB-msg-with-10MB-receive-window
-----                                             ---------------------------------------------------
concurrent/adsl2+/#streams1/#messages1            1.00       4.3±0.02s     2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1           1.00      42.7±0.06s     2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1          1.00   272.6±14.40ms    36.7 MB/sec
concurrent/gbit-lan/#streams10/#messages1         1.00       2.6±0.19s    38.9 MB/sec
concurrent/mobile/#streams1/#messages1            1.00      14.2±0.01s   722.8 KB/sec
concurrent/mobile/#streams10/#messages1           1.00     140.9±0.22s   726.7 KB/sec
concurrent/unconstrained/#streams1/#messages1     1.00     10.6±0.57ms   945.6 MB/sec
concurrent/unconstrained/#streams10/#messages1    1.00     54.9±3.17ms  1820.7 MB/sec

How would the proposal described in this issue cope with a large message size (10MB), sending WindowUpdate messages early but leaving the receive window at the default value (256 KB)? Off the top of my head, I expected similar numbers to the optimization suggested in https://github.com/libp2p/rust-libp2p/issues/1849 given that with early WindowUpdate messages the sender should never be blocked for a whole round-trip waiting for a WindowUpdate message.

critcmp on-read-on-half-with-10MB-msg                      
group                                             on-read-on-half-with-10MB-msg
-----                                             -----------------------------
concurrent/adsl2+/#streams1/#messages1            1.00       6.1±0.02s  1684.7 KB/sec
concurrent/adsl2+/#streams10/#messages1           1.00      42.5±0.09s     2.4 MB/sec
concurrent/gbit-lan/#streams1/#messages1          1.00   272.7±23.24ms    36.7 MB/sec
concurrent/gbit-lan/#streams10/#messages1         1.00       2.7±0.09s    36.5 MB/sec
concurrent/mobile/#streams1/#messages1            1.00      16.1±0.03s   636.6 KB/sec
concurrent/mobile/#streams10/#messages1           1.00     140.8±0.11s   727.5 KB/sec
concurrent/unconstrained/#streams1/#messages1     1.00      4.0±0.06ms     2.5 GB/sec
concurrent/unconstrained/#streams10/#messages1    1.00     33.8±1.11ms     2.9 GB/sec

My assumption turns out to be wrong. In the single stream case one achieves a similar bandwidth to the first benchmark using plain develop, no early WindowUpdate messages, no increased receive window. Why is that the case? Shouldn't the sender be able to continuously send frames given that the receiver sends WindowUpdate messages before the sender exhausted its window?

The answer is simple: When sending a data frame, Yamux will choose the smaller of (1) available credit (window) and (2) available data to be send as the size for the next frame. In the large message (10MB) example each frame will thus have the size of the available credit. The receiver considers sending a new WindowUpdate messsage before each frame received. Given that each frame exhausts the entire window, the receiver never sends early WindowUpdate messages, but instead, sends them once the window is 0 (after each frame). The sender ends up being blocked for a whole round trip waiting for the receiver to send a WindowUpdate message, just like the Yamux version on current develop branch without special configuration.

As mentioned before, Yamux is inspired by HTTP/2, so how does HTTP/2 solve this issue?

Technically, the length field allows payloads of up to 2^24 bytes (~16MB) per frame. However, the HTTP/2 standard sets the default maximum payload size of DATA frames to 2^14bytes (~16KB) per frame and allows the client and server to negotiate the higher value. Bigger is not always better: smaller frame size enables efficient multiplexing and minimizes head-of-line blocking.

https://hpbn.co/http2/

Sending large frames can result in delays in sending time-sensitive frames (such as RST_STREAM, WINDOW_UPDATE, or PRIORITY), which, if blocked by the transmission of a large frame, could affect performance.

https://tools.ietf.org/html/rfc7540#section-4.2

To achieve a similar bandwidth to the proposal in https://github.com/libp2p/rust-libp2p/issues/1849, how about sending only small data frames, like the HTTP/2 specification suggests, allowing the receiver to send early WindowUpdate messages in between frames? Below is the output of a benchmark run, sending early WindowUpdate messages, using the default receive window (256 KB), limiting the size of data frames to half of the default receive window (256KB / 2 = 128 KB). Sending multiple small frames instead of one large frame introduces an overhead due to the additional Yamux headers. A Yamux header is 12 bytes large, adding an additional header for every 128KB of payload is a negligible overhead (12 bytes header / (256 * 1024 default receive window / 2) = 0.000091553).

critcmp on-read-on-half-with-10MB-msg-split-frames
group                                             on-read-on-half-with-10MB-msg-split-frames
-----                                             ------------------------------------------
concurrent/adsl2+/#streams1/#messages1            1.00       4.3±0.02s     2.4 MB/sec
concurrent/adsl2+/#streams10/#messages1           1.00      42.4±0.04s     2.4 MB/sec
concurrent/gbit-lan/#streams1/#messages1          1.00   240.6±14.98ms    41.6 MB/sec
concurrent/gbit-lan/#streams10/#messages1         1.00       2.7±0.08s    37.5 MB/sec
concurrent/mobile/#streams1/#messages1            1.00      14.1±0.01s   727.5 KB/sec
concurrent/mobile/#streams10/#messages1           1.00     140.6±0.08s   728.3 KB/sec
concurrent/unconstrained/#streams1/#messages1     1.00      4.1±0.11ms     2.4 GB/sec
concurrent/unconstrained/#streams10/#messages1    1.00     37.5±0.83ms     2.6 GB/sec

The benchmark shows that with (1) early WindowUpdate messages and (2) data frames restricted in size one achieves the same bandwidth in the single stream case across the different network types as the proposal described in https://github.com/libp2p/rust-libp2p/issues/1849 achieves. All without the need to configure the window size.

(Note: It might still be worth increasing the window size when operating on top of a network with a high Bandwidth Delay Product (_Long fat network_). As a first guess I would expect Yamux to be operated mostly on networks with a bandwidth delay product below the default window size (256KB) (see examples) though I would need to put more thoughts and benchmarking into this to form a solid opinion.)

Above @romanb started the discussion on when to send an early WindowUpdate message.

What exactly do you have in mind in terms of implementation? My first thought would be to introduce a numeric factor to the configuration that influences the threshold for window updates w.r.t. the current receive window, and which is currently fixed to 0.

Thus far I have only experimented with the naive approach of sending a WindowUpdate message after half or more of the window has been used. How does sending WindowUpdate messages earlier or later than that influence bandwidth?

First off, to establish some groundwork, the below benchmark output compares what we already have today (on-read-on-0) with the case of having an infinite window (on-read-max-window) and sending the WindowUpdate message after half the window has been consumed (on-read-on-half).

critcmp on-read-max-window on-read-on-half on-read-on-0
group                                                on-read-max-window                     on-read-on-0                           on-read-on-half
-----                                                ------------------                     ------------                           ---------------
concurrent/adsl2+/#streams1/#messages1000            1.03   1723.4±2.91ms     2.3 MB/sec    1.41       2.4±0.01s  1695.6 KB/sec    1.00   1672.1±3.50ms     2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1000           1.00      16.6±0.00s     2.3 MB/sec    1.00      16.6±0.00s     2.3 MB/sec    1.00      16.6±0.02s     2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1000          1.00     74.5±0.24ms    52.4 MB/sec    1.04     77.5±0.41ms    50.4 MB/sec    1.01     74.9±0.82ms    52.1 MB/sec
concurrent/gbit-lan/#streams10/#messages1000         1.00    746.4±5.38ms    52.3 MB/sec    1.00    744.5±6.81ms    52.5 MB/sec    1.00    747.4±6.54ms    52.3 MB/sec
concurrent/mobile/#streams1/#messages1000            1.02       5.6±0.01s   708.3 KB/sec    1.27       7.1±0.01s   566.4 KB/sec    1.00       5.5±0.01s   721.0 KB/sec
concurrent/mobile/#streams10/#messages1000           1.00      55.3±0.04s   723.3 KB/sec    1.00      55.4±0.08s   722.6 KB/sec    1.00      55.3±0.05s   723.5 KB/sec
concurrent/unconstrained/#streams1/#messages1000     1.13     10.6±0.76ms   369.7 MB/sec    1.00      9.4±0.57ms   417.3 MB/sec    1.05      9.8±0.43ms   398.0 MB/sec
concurrent/unconstrained/#streams10/#messages1000    1.28    128.2±6.17ms   304.6 MB/sec    1.00    100.0±3.05ms   390.4 MB/sec    1.08    108.5±4.89ms   360.1 MB/sec

As you can see sending the WindowUpdate message on 0 (on-read-on-0) performs worst (e.g. adsl2+ 1695.6 KB/sec). Having an infinite window (on-read-max-window), which I would expect to achieve the maximum bandwidth, performs better on every network type (e.g. adsl2+ 2.3 MB/sec). The last strategy, and the one proposed in this Github issue, sending WindowUpdate messages after half of the window being consumed performs equally (or better) than having a infinite window (on-read-max-window) (e.g. adsl2+ 2.3 MB/sec).

How do other strategies behave? Below is the benchmark output for sending the WindowUpdate message after 1/4 (on-read-on-quarter) as well as after 3/4 (on-read-on-three-quarter) of the window have been consumed.

critcmp on-read-on-quarter on-read-on-three-quarter 
group                                                on-read-on-quarter                     on-read-on-three-quarter
-----                                                ------------------                     ------------------------
concurrent/adsl2+/#streams1/#messages1000            1.00   1668.8±0.65ms     2.3 MB/sec    1.26       2.1±0.01s  1896.1 KB/sec
concurrent/adsl2+/#streams10/#messages1000           1.00      16.6±0.02s     2.3 MB/sec    1.00      16.7±0.04s     2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1000          1.00     76.4±0.93ms    51.1 MB/sec    1.00     76.2±1.47ms    51.3 MB/sec
concurrent/gbit-lan/#streams10/#messages1000         1.01    757.3±8.36ms    51.6 MB/sec    1.00   749.7±10.77ms    52.1 MB/sec
concurrent/mobile/#streams1/#messages1000            1.00       5.5±0.00s   723.2 KB/sec    1.01       5.6±0.00s   716.6 KB/sec
concurrent/mobile/#streams10/#messages1000           1.00      55.1±0.02s   725.8 KB/sec    1.00      55.2±0.08s   724.6 KB/sec
concurrent/unconstrained/#streams1/#messages1000     1.02     10.5±0.44ms   371.4 MB/sec    1.00     10.3±0.36ms   379.7 MB/sec
concurrent/unconstrained/#streams10/#messages1000                                           1.00    109.4±3.42ms   357.0 MB/sec

Sending the WindowUpdate message after 1/4th of the window being consumed (on-read-on-quarter) seems to roughly equal in bandwidth to sending the WindowUpdate message on half (on-read-on-half). Sending WindowUpdate messages later after 3/4th of the window has been consumed shows a degradation in bandwidth for the adsl2+ network type, but similar results for all others.

For now I will stick with sending the WindowUpdate message after half of the window being consumed. I am happy to benchmark other alternative strategies though.

Following up once more on https://github.com/paritytech/yamux/issues/100#issuecomment-774461550 more specifically limiting the maximum size of the payload in a Yamux data frame.

The benchmarks below are run on top of https://github.com/paritytech/yamux/pull/109 using a message size of 10 MiB.

on-read-on-half-10MB-msg does not restrict the payload size, thus defaulting to the maximum window size of 256 KiB.
on-read-on-half-10MB-msg-split-128kb limiting the payload size to 128 KiB (half of the default window size). Overhead 0.000091553.
on-read-on-half-10MB-msg-split-16kb limiting the payload size to 16 KiB (Same as HTTP/2 see spec). Overhead 0.000732422.
on-read-on-half-10MB-msg-split-1kb limiting the payload to 1KiB. Overhead 0.01171875.

critcmp on-read-on-half-10MB-msg on-read-on-half-10MB-msg-split-128kb on-read-on-half-10MB-msg-split-16kb on-read-on-half-10MB-msg-split-1kb
group                                             on-read-on-half-10MB-msg               on-read-on-half-10MB-msg-split-128kb    on-read-on-half-10MB-msg-split-16kb    on-read-on-half-10MB-msg-split-1kb
-----                                             ------------------------               ------------------------------------    -----------------------------------    ----------------------------------
concurrent/adsl2+/#streams1/#messages1            1.42       6.0±0.01s  1696.0 KB/sec    1.00       4.2±0.01s     2.4 MB/sec     1.00       4.3±0.01s     2.3 MB/sec    1.02       4.3±0.01s     2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1           1.00      42.4±0.01s     2.4 MB/sec    1.00      42.4±0.05s     2.4 MB/sec     1.00      42.5±0.07s     2.4 MB/sec    1.01      42.9±0.03s     2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1          1.81    246.6±6.43ms    40.5 MB/sec    1.99    270.7±4.51ms    36.9 MB/sec     1.49   202.7±18.99ms    49.3 MB/sec    1.00    136.2±1.88ms    73.4 MB/sec
concurrent/gbit-lan/#streams10/#messages1         1.64       2.3±0.05s    43.1 MB/sec    1.88       2.7±0.10s    37.6 MB/sec     1.34  1902.0±16.45ms    52.6 MB/sec    1.00  1417.7±35.28ms    70.5 MB/sec
concurrent/mobile/#streams1/#messages1            1.14      16.1±0.02s   637.7 KB/sec    1.00      14.1±0.00s   728.1 KB/sec     1.00      14.1±0.02s   727.1 KB/sec    1.01      14.2±0.00s   719.9 KB/sec
concurrent/mobile/#streams10/#messages1           1.00     140.8±0.23s   727.2 KB/sec    1.00     140.7±0.21s   727.8 KB/sec     1.00     140.9±0.18s   726.6 KB/sec    1.01     142.3±0.11s   719.4 KB/sec
concurrent/unconstrained/#streams1/#messages1     1.46      6.4±0.39ms  1555.8 MB/sec    1.00      4.4±0.24ms     2.2 GB/sec     1.83      8.1±0.41ms  1240.2 MB/sec    22.57    99.5±7.80ms   100.5 MB/sec
concurrent/unconstrained/#streams10/#messages1    1.61     69.6±5.01ms  1437.0 MB/sec    1.00     43.2±5.82ms     2.3 GB/sec     2.50    107.9±5.05ms   927.2 MB/sec    24.99 1078.9±165.42ms    92.7 MB/sec

The benchmarks do not give a clear winner. I am leaning towards simply using 16KiB given that it outperforms the status quo in each run (except for the unconstrained network type), introduces a low overhead and is in line with the HTTP/2 spec (though I could not yet find the reasoning for choosing 16 KiB in HTTP/2).

To add some data to the above from a more realistic environment, I have setup a virtual server in Helsinki. On this server I deployed:

A version of libp2p-perf using the latest Yamux develop branch with no custom configuration other than using WindowUpdateMode::OnRead.
A version of libp2p-perf using the latest Yamux develop branch plus both https://github.com/paritytech/yamux/pull/109 as well as https://github.com/paritytech/yamux/pull/111 with no custom configuration other than using WindowUpdateMode::OnReceive.

First off, to establish a baseline, here is the output of an iperf run:

iperf -c xxx.xxxx.xxx.xxx
------------------------------------------------------------
Client connecting to xxx.xxx.xxx.xxx, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.107 port 56634 connected with xxx.xxx.xxx.xxx port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.9 sec  14.5 MBytes  11.2 Mbits/sec

Running option (1) (current develop branch):

./client-old --server-address  /ip4/xxx.xxx.xxx.xxx/tcp/9992                                     
Interval        Transfer        Bandwidth
0 s - 10.26 s   9 MBytes        7.01 MBit/s

Running option (2) (early window updates, small frames):

./client-new --server-address  /ip4/xxx.xxx.xxx.xxx/tcp/9992                                    
Interval        Transfer        Bandwidth
0 s - 10.04 s   11 MBytes       8.76 MBit/s

I would boldly claim that this whole effort does get us a bit closer to what a plain TCP connection can do :tada:.

libp2p / rust-yamux