Closed mxinden closed 3 years ago
So intuitively, the idea is that sending (partial) window updates should compensate to some degree for senders having to pause due to transmission delays, is that right? Since if the transmission delay is negligible, the sender pretty much gets the window update as soon as the current window is exhausted (assuming the reader can consume data at the same pace as it is sent and received). What exactly do you have in mind in terms of implementation? My first thought would be to introduce a numeric factor to the configuration that influences the threshold for window updates w.r.t. the current receive window, and which is currently fixed to 0. I think it would apply equally to both of the existing WindowUpdateMode
s. In any case, curious to see what your benchmarks show, especially if you manage to simulate different network latencies. I couldn't find any particular mention of the reasoning behind the threshold choice in nghttp2
, so I just assume they made that decision based on their own benchmarks.
So intuitively, the idea is that sending (partial) window updates should compensate to some degree for senders having to pause due to transmission delays, is that right?
Correct.
What exactly do you have in mind in terms of implementation?
Thus far I have only tested out the naive approach of sending the WindowUpdate after half of the window has been exhausted, hoping for the WindowUpdate to arrive at the sender before the sender exhausted the other half. I have not yet measured the overhead of potentially sending twice as many WindowUpdate messages back to the sender.
My first thought would be to introduce a numeric factor to the configuration that influences the threshold for window updates w.r.t. the current receive window, and which is currently fixed to 0.
Off the top of my head I am reluctant to introduce yet another tunable. I find todays configuration surface, while small, already difficult to get right . That said, I have not yet explored the impact different numeric factors would have.
I think it would apply equally to both of the existing WindowUpdateModes.
Thanks for raising this. Thus far I was under the impression with OnReceive
WindowUpdate would be send on each new frame. Instead today this is happening on window exhaustion as well.
In any case, curious to see what your benchmarks show, especially if you manage to simulate different network latencies.
I haven't done enough testing, but I can already share a specific use-case which looks promising. Using the altered benchmarks from https://github.com/paritytech/yamux/pull/102 with an adsl2+ connection (20 Mbit/s, 20ms RTT), each message being 4096 bytes I end up with:
$ critcmp on-read-on-0 on-read-on-half
group on-read-on-0 on-read-on-half
----- ------------ ---------------
concurrent/adsl2+/#streams1/#messages1000 1.41 2.4±0.00s 1700.9 KB/sec 1.00 1670.6±8.29ms 2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1000 1.00 16.6±0.01s 2.3 MB/sec 1.00 16.6±0.01s 2.3 MB/sec
As one would expect, sending an early window update while concurrently using 10
channels does not have an impact on throughput. Whenever one stream exhausted its window, another stream can likely use the bandwidth of the underlying link.
Using only a single stream one can see an increase from 1.7 MBit/s to 2.3 MBit/s, thus a 35 % increase in bandwidth. Note that by sending WindowUpdate messages early using a single stream one achieves the same bandwidth as one would with 10 streams sending WindowUpdate messages either early or on 0.
For the sake of completeness, here are the numbers for all network types, again comparing WindowUpdate messages send at window exhaustion (0) vs WindowUpdate messages send on exhaustion of half the window:
critcmp on-read-on-0 on-read-on-half
group on-read-on-0 on-read-on-half
----- ------------ ---------------
concurrent/adsl2+/#streams1/#messages1000 1.41 2.4±0.01s 1695.6 KB/sec 1.00 1672.1±3.50ms 2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1000 1.00 16.6±0.00s 2.3 MB/sec 1.00 16.6±0.02s 2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1000 1.03 77.5±0.41ms 50.4 MB/sec 1.00 74.9±0.82ms 52.1 MB/sec
concurrent/gbit-lan/#streams10/#messages1000 1.00 744.5±6.81ms 52.5 MB/sec 1.00 747.4±6.54ms 52.3 MB/sec
concurrent/mobile/#streams1/#messages1000 1.27 7.1±0.01s 566.4 KB/sec 1.00 5.5±0.01s 721.0 KB/sec
concurrent/mobile/#streams10/#messages1000 1.00 55.4±0.08s 722.6 KB/sec 1.00 55.3±0.05s 723.5 KB/sec
concurrent/unconstrained/#streams1/#messages1000 1.00 9.4±0.57ms 417.3 MB/sec 1.05 9.8±0.43ms 398.0 MB/sec
concurrent/unconstrained/#streams10/#messages1000 1.00 100.0±3.05ms 390.4 MB/sec 1.08 108.5±4.89ms 360.1 MB/sec
The characteristics described in the previous comment show up once again for the remaining network types except for the unconstrained
network type. As one would expect the unconstrained
network type is slowed down by the increased number of WindowUpdate messages.
Numbers presented above are using WindowUpdateMode::OnRead
. As Roman pointed out earlier, sending WindowUpdate messages early can as well be done when using WindowUpdateMode::OnReceive
. I adjusted my (yet to be published) implementation to send early WindowUpdate messages in both modes. Below are the benchmark outputs (scroll to the right):
critcmp on-receive-on-0 on-receive-on-half on-read-on-0 on-read-on-half
group on-read-on-0 on-read-on-half on-receive-on-0 on-receive-on-half
----- ------------ --------------- --------------- ------------------
concurrent/adsl2+/#streams1/#messages1000 1.41 2.4±0.01s 1695.6 KB/sec 1.00 1672.1±3.50ms 2.3 MB/sec 1.41 2.3±0.00s 1702.6 KB/sec 1.00 1669.0±1.24ms 2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1000 1.00 16.6±0.00s 2.3 MB/sec 1.00 16.6±0.02s 2.3 MB/sec 1.00 16.6±0.01s 2.3 MB/sec 1.00 16.7±0.02s 2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1000 1.03 77.5±0.41ms 50.4 MB/sec 1.00 74.9±0.82ms 52.1 MB/sec 1.10 82.5±2.40ms 47.3 MB/sec 1.01 75.7±0.96ms 51.6 MB/sec
concurrent/gbit-lan/#streams10/#messages1000 1.00 744.5±6.81ms 52.5 MB/sec 1.00 747.4±6.54ms 52.3 MB/sec 1.03 765.2±12.58ms 51.0 MB/sec 1.03 766.7±17.11ms 51.0 MB/sec
concurrent/mobile/#streams1/#messages1000 1.27 7.1±0.01s 566.4 KB/sec 1.00 5.5±0.01s 721.0 KB/sec 1.27 7.0±0.01s 567.8 KB/sec 1.00 5.5±0.01s 721.7 KB/sec
concurrent/mobile/#streams10/#messages1000 1.00 55.4±0.08s 722.6 KB/sec 1.00 55.3±0.05s 723.5 KB/sec 1.00 55.1±0.02s 725.7 KB/sec 1.00 55.2±0.07s 725.1 KB/sec
concurrent/unconstrained/#streams1/#messages1000 1.00 9.4±0.57ms 417.3 MB/sec 1.05 9.8±0.43ms 398.0 MB/sec 1.16 10.9±0.30ms 360.0 MB/sec 1.20 11.3±0.28ms 347.0 MB/sec
concurrent/unconstrained/#streams10/#messages1000 1.00 100.0±3.05ms 390.4 MB/sec 1.08 108.5±4.89ms 360.1 MB/sec 1.17 116.7±3.46ms 334.9 MB/sec 1.17 116.7±2.87ms 334.8 MB/sec
Same characteristics described above for WindowUpdateMode::OnRead
seem to apply for WindowUpdateMode::OnReceive
. Namely:
https://github.com/libp2p/rust-libp2p/issues/1849 describes a related performance optimization proposal. As a sender, when expecting to receive a large message (say 10 MB), one could increase the receive window up-front, allowing the sender to send the entire large message in one go instead of sending chunks bounded by the default receive window.
The below benchmarks use a message size of 10MB following the example in https://github.com/libp2p/rust-libp2p/issues/1849.
First off lets look at the performance of Yamux based on the current develop
branch. The receive window is left unchanged (256 KB), WindowUpdate messages are send on window exhaustion, WindowUpdateMode::OnRead
is used.
critcmp on-read-on-0-with-10MB-msg
group on-read-on-0-with-10MB-msg
----- --------------------------
concurrent/adsl2+/#streams1/#messages1 1.00 6.1±0.03s 1684.8 KB/sec
concurrent/adsl2+/#streams10/#messages1 1.00 42.6±0.14s 2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1 1.00 270.6±27.16ms 36.9 MB/sec
concurrent/gbit-lan/#streams10/#messages1 1.00 2.8±0.07s 36.1 MB/sec
concurrent/mobile/#streams1/#messages1 1.00 16.0±0.00s 638.3 KB/sec
concurrent/mobile/#streams10/#messages1 1.00 140.8±0.24s 727.2 KB/sec
concurrent/unconstrained/#streams1/#messages1 1.00 3.8±0.10ms 2.6 GB/sec
concurrent/unconstrained/#streams10/#messages1 1.00 33.3±2.46ms 2.9 GB/sec
As already noted in previous comments, the current implementation on develop
suffers a decreased bandwidth when using a single stream, given that the sender side is blocked each 256KB for a single round-trip (see adsl2+ 1.6848MB/sec vs 2.3 MB/sec) waiting for the receiver to send a WindowUpdate message.
Running the same benchmark once more with a receive window set to the message size (10MB) as suggested in https://github.com/libp2p/rust-libp2p/issues/1849 the above mentioned decreased bandwidth when using a single channel is gone (see adsl2+ 2.3 MB/sec == 2.3 MB/sec).
critcmp on-read-on-0-with-10MB-msg-with-10MB-receive-window
group on-read-on-0-with-10MB-msg-with-10MB-receive-window
----- ---------------------------------------------------
concurrent/adsl2+/#streams1/#messages1 1.00 4.3±0.02s 2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1 1.00 42.7±0.06s 2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1 1.00 272.6±14.40ms 36.7 MB/sec
concurrent/gbit-lan/#streams10/#messages1 1.00 2.6±0.19s 38.9 MB/sec
concurrent/mobile/#streams1/#messages1 1.00 14.2±0.01s 722.8 KB/sec
concurrent/mobile/#streams10/#messages1 1.00 140.9±0.22s 726.7 KB/sec
concurrent/unconstrained/#streams1/#messages1 1.00 10.6±0.57ms 945.6 MB/sec
concurrent/unconstrained/#streams10/#messages1 1.00 54.9±3.17ms 1820.7 MB/sec
How would the proposal described in this issue cope with a large message size (10MB), sending WindowUpdate messages early but leaving the receive window at the default value (256 KB)? Off the top of my head, I expected similar numbers to the optimization suggested in https://github.com/libp2p/rust-libp2p/issues/1849 given that with early WindowUpdate messages the sender should never be blocked for a whole round-trip waiting for a WindowUpdate message.
critcmp on-read-on-half-with-10MB-msg
group on-read-on-half-with-10MB-msg
----- -----------------------------
concurrent/adsl2+/#streams1/#messages1 1.00 6.1±0.02s 1684.7 KB/sec
concurrent/adsl2+/#streams10/#messages1 1.00 42.5±0.09s 2.4 MB/sec
concurrent/gbit-lan/#streams1/#messages1 1.00 272.7±23.24ms 36.7 MB/sec
concurrent/gbit-lan/#streams10/#messages1 1.00 2.7±0.09s 36.5 MB/sec
concurrent/mobile/#streams1/#messages1 1.00 16.1±0.03s 636.6 KB/sec
concurrent/mobile/#streams10/#messages1 1.00 140.8±0.11s 727.5 KB/sec
concurrent/unconstrained/#streams1/#messages1 1.00 4.0±0.06ms 2.5 GB/sec
concurrent/unconstrained/#streams10/#messages1 1.00 33.8±1.11ms 2.9 GB/sec
My assumption turns out to be wrong. In the single stream case one achieves a similar bandwidth to the first benchmark using plain develop
, no early WindowUpdate messages, no increased receive window. Why is that the case? Shouldn't the sender be able to continuously send frames given that the receiver sends WindowUpdate messages before the sender exhausted its window?
The answer is simple: When sending a data frame, Yamux will choose the smaller of (1) available credit (window) and (2) available data to be send as the size for the next frame. In the large message (10MB) example each frame will thus have the size of the available credit. The receiver considers sending a new WindowUpdate messsage before each frame received. Given that each frame exhausts the entire window, the receiver never sends early WindowUpdate messages, but instead, sends them once the window is 0 (after each frame). The sender ends up being blocked for a whole round trip waiting for the receiver to send a WindowUpdate message, just like the Yamux version on current develop
branch without special configuration.
As mentioned before, Yamux is inspired by HTTP/2, so how does HTTP/2 solve this issue?
Technically, the length field allows payloads of up to 2^24 bytes (~16MB) per frame. However, the HTTP/2 standard sets the default maximum payload size of DATA frames to 2^14bytes (~16KB) per frame and allows the client and server to negotiate the higher value. Bigger is not always better: smaller frame size enables efficient multiplexing and minimizes head-of-line blocking.
Sending large frames can result in delays in sending time-sensitive frames (such as RST_STREAM, WINDOW_UPDATE, or PRIORITY), which, if blocked by the transmission of a large frame, could affect performance.
https://tools.ietf.org/html/rfc7540#section-4.2
To achieve a similar bandwidth to the proposal in https://github.com/libp2p/rust-libp2p/issues/1849, how about sending only small data frames, like the HTTP/2 specification suggests, allowing the receiver to send early WindowUpdate messages in between frames? Below is the output of a benchmark run, sending early WindowUpdate messages, using the default receive window (256 KB), limiting the size of data frames to half of the default receive window (256KB / 2 = 128 KB). Sending multiple small frames instead of one large frame introduces an overhead due to the additional Yamux headers. A Yamux header is 12 bytes large, adding an additional header for every 128KB of payload is a negligible overhead (12 bytes header / (256 * 1024 default receive window / 2) = 0.000091553
).
critcmp on-read-on-half-with-10MB-msg-split-frames
group on-read-on-half-with-10MB-msg-split-frames
----- ------------------------------------------
concurrent/adsl2+/#streams1/#messages1 1.00 4.3±0.02s 2.4 MB/sec
concurrent/adsl2+/#streams10/#messages1 1.00 42.4±0.04s 2.4 MB/sec
concurrent/gbit-lan/#streams1/#messages1 1.00 240.6±14.98ms 41.6 MB/sec
concurrent/gbit-lan/#streams10/#messages1 1.00 2.7±0.08s 37.5 MB/sec
concurrent/mobile/#streams1/#messages1 1.00 14.1±0.01s 727.5 KB/sec
concurrent/mobile/#streams10/#messages1 1.00 140.6±0.08s 728.3 KB/sec
concurrent/unconstrained/#streams1/#messages1 1.00 4.1±0.11ms 2.4 GB/sec
concurrent/unconstrained/#streams10/#messages1 1.00 37.5±0.83ms 2.6 GB/sec
The benchmark shows that with (1) early WindowUpdate messages and (2) data frames restricted in size one achieves the same bandwidth in the single stream case across the different network types as the proposal described in https://github.com/libp2p/rust-libp2p/issues/1849 achieves. All without the need to configure the window size.
(Note: It might still be worth increasing the window size when operating on top of a network with a high Bandwidth Delay Product (_Long fat network_). As a first guess I would expect Yamux to be operated mostly on networks with a bandwidth delay product below the default window size (256KB) (see examples) though I would need to put more thoughts and benchmarking into this to form a solid opinion.)
Above @romanb started the discussion on when to send an early WindowUpdate message.
What exactly do you have in mind in terms of implementation? My first thought would be to introduce a numeric factor to the configuration that influences the threshold for window updates w.r.t. the current receive window, and which is currently fixed to 0.
Thus far I have only experimented with the naive approach of sending a WindowUpdate message after half or more of the window has been used. How does sending WindowUpdate messages earlier or later than that influence bandwidth?
First off, to establish some groundwork, the below benchmark output compares what we already have today (on-read-on-0
) with the case of having an infinite window (on-read-max-window
) and sending the WindowUpdate message after half the window has been consumed (on-read-on-half
).
critcmp on-read-max-window on-read-on-half on-read-on-0
group on-read-max-window on-read-on-0 on-read-on-half
----- ------------------ ------------ ---------------
concurrent/adsl2+/#streams1/#messages1000 1.03 1723.4±2.91ms 2.3 MB/sec 1.41 2.4±0.01s 1695.6 KB/sec 1.00 1672.1±3.50ms 2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1000 1.00 16.6±0.00s 2.3 MB/sec 1.00 16.6±0.00s 2.3 MB/sec 1.00 16.6±0.02s 2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1000 1.00 74.5±0.24ms 52.4 MB/sec 1.04 77.5±0.41ms 50.4 MB/sec 1.01 74.9±0.82ms 52.1 MB/sec
concurrent/gbit-lan/#streams10/#messages1000 1.00 746.4±5.38ms 52.3 MB/sec 1.00 744.5±6.81ms 52.5 MB/sec 1.00 747.4±6.54ms 52.3 MB/sec
concurrent/mobile/#streams1/#messages1000 1.02 5.6±0.01s 708.3 KB/sec 1.27 7.1±0.01s 566.4 KB/sec 1.00 5.5±0.01s 721.0 KB/sec
concurrent/mobile/#streams10/#messages1000 1.00 55.3±0.04s 723.3 KB/sec 1.00 55.4±0.08s 722.6 KB/sec 1.00 55.3±0.05s 723.5 KB/sec
concurrent/unconstrained/#streams1/#messages1000 1.13 10.6±0.76ms 369.7 MB/sec 1.00 9.4±0.57ms 417.3 MB/sec 1.05 9.8±0.43ms 398.0 MB/sec
concurrent/unconstrained/#streams10/#messages1000 1.28 128.2±6.17ms 304.6 MB/sec 1.00 100.0±3.05ms 390.4 MB/sec 1.08 108.5±4.89ms 360.1 MB/sec
As you can see sending the WindowUpdate message on 0 (on-read-on-0
) performs worst (e.g. adsl2+
1695.6 KB/sec
). Having an infinite window (on-read-max-window
), which I would expect to achieve the maximum bandwidth, performs better on every network type (e.g. adsl2+
2.3 MB/sec
). The last strategy, and the one proposed in this Github issue, sending WindowUpdate messages after half of the window being consumed performs equally (or better) than having a infinite window (on-read-max-window
) (e.g. adsl2+
2.3 MB/sec
).
How do other strategies behave? Below is the benchmark output for sending the WindowUpdate message after 1/4 (on-read-on-quarter
) as well as after 3/4 (on-read-on-three-quarter
) of the window have been consumed.
critcmp on-read-on-quarter on-read-on-three-quarter
group on-read-on-quarter on-read-on-three-quarter
----- ------------------ ------------------------
concurrent/adsl2+/#streams1/#messages1000 1.00 1668.8±0.65ms 2.3 MB/sec 1.26 2.1±0.01s 1896.1 KB/sec
concurrent/adsl2+/#streams10/#messages1000 1.00 16.6±0.02s 2.3 MB/sec 1.00 16.7±0.04s 2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1000 1.00 76.4±0.93ms 51.1 MB/sec 1.00 76.2±1.47ms 51.3 MB/sec
concurrent/gbit-lan/#streams10/#messages1000 1.01 757.3±8.36ms 51.6 MB/sec 1.00 749.7±10.77ms 52.1 MB/sec
concurrent/mobile/#streams1/#messages1000 1.00 5.5±0.00s 723.2 KB/sec 1.01 5.6±0.00s 716.6 KB/sec
concurrent/mobile/#streams10/#messages1000 1.00 55.1±0.02s 725.8 KB/sec 1.00 55.2±0.08s 724.6 KB/sec
concurrent/unconstrained/#streams1/#messages1000 1.02 10.5±0.44ms 371.4 MB/sec 1.00 10.3±0.36ms 379.7 MB/sec
concurrent/unconstrained/#streams10/#messages1000 1.00 109.4±3.42ms 357.0 MB/sec
Sending the WindowUpdate message after 1/4th of the window being consumed (on-read-on-quarter
) seems to roughly equal in bandwidth to sending the WindowUpdate message on half (on-read-on-half
). Sending WindowUpdate messages later after 3/4th of the window has been consumed shows a degradation in bandwidth for the adsl2+
network type, but similar results for all others.
For now I will stick with sending the WindowUpdate message after half of the window being consumed. I am happy to benchmark other alternative strategies though.
Following up once more on https://github.com/paritytech/yamux/issues/100#issuecomment-774461550 more specifically limiting the maximum size of the payload in a Yamux data frame.
The benchmarks below are run on top of https://github.com/paritytech/yamux/pull/109 using a message size of 10 MiB.
on-read-on-half-10MB-msg
does not restrict the payload size, thus defaulting to the maximum window size of 256 KiB.on-read-on-half-10MB-msg-split-128kb
limiting the payload size to 128 KiB (half of the default window size). Overhead 0.000091553.on-read-on-half-10MB-msg-split-16kb
limiting the payload size to 16 KiB (Same as HTTP/2 see spec). Overhead 0.000732422.on-read-on-half-10MB-msg-split-1kb
limiting the payload to 1KiB. Overhead 0.01171875.critcmp on-read-on-half-10MB-msg on-read-on-half-10MB-msg-split-128kb on-read-on-half-10MB-msg-split-16kb on-read-on-half-10MB-msg-split-1kb
group on-read-on-half-10MB-msg on-read-on-half-10MB-msg-split-128kb on-read-on-half-10MB-msg-split-16kb on-read-on-half-10MB-msg-split-1kb
----- ------------------------ ------------------------------------ ----------------------------------- ----------------------------------
concurrent/adsl2+/#streams1/#messages1 1.42 6.0±0.01s 1696.0 KB/sec 1.00 4.2±0.01s 2.4 MB/sec 1.00 4.3±0.01s 2.3 MB/sec 1.02 4.3±0.01s 2.3 MB/sec
concurrent/adsl2+/#streams10/#messages1 1.00 42.4±0.01s 2.4 MB/sec 1.00 42.4±0.05s 2.4 MB/sec 1.00 42.5±0.07s 2.4 MB/sec 1.01 42.9±0.03s 2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1 1.81 246.6±6.43ms 40.5 MB/sec 1.99 270.7±4.51ms 36.9 MB/sec 1.49 202.7±18.99ms 49.3 MB/sec 1.00 136.2±1.88ms 73.4 MB/sec
concurrent/gbit-lan/#streams10/#messages1 1.64 2.3±0.05s 43.1 MB/sec 1.88 2.7±0.10s 37.6 MB/sec 1.34 1902.0±16.45ms 52.6 MB/sec 1.00 1417.7±35.28ms 70.5 MB/sec
concurrent/mobile/#streams1/#messages1 1.14 16.1±0.02s 637.7 KB/sec 1.00 14.1±0.00s 728.1 KB/sec 1.00 14.1±0.02s 727.1 KB/sec 1.01 14.2±0.00s 719.9 KB/sec
concurrent/mobile/#streams10/#messages1 1.00 140.8±0.23s 727.2 KB/sec 1.00 140.7±0.21s 727.8 KB/sec 1.00 140.9±0.18s 726.6 KB/sec 1.01 142.3±0.11s 719.4 KB/sec
concurrent/unconstrained/#streams1/#messages1 1.46 6.4±0.39ms 1555.8 MB/sec 1.00 4.4±0.24ms 2.2 GB/sec 1.83 8.1±0.41ms 1240.2 MB/sec 22.57 99.5±7.80ms 100.5 MB/sec
concurrent/unconstrained/#streams10/#messages1 1.61 69.6±5.01ms 1437.0 MB/sec 1.00 43.2±5.82ms 2.3 GB/sec 2.50 107.9±5.05ms 927.2 MB/sec 24.99 1078.9±165.42ms 92.7 MB/sec
The benchmarks do not give a clear winner. I am leaning towards simply using 16KiB given that it outperforms the status quo in each run (except for the unconstrained network type), introduces a low overhead and is in line with the HTTP/2 spec (though I could not yet find the reasoning for choosing 16 KiB in HTTP/2).
To add some data to the above from a more realistic environment, I have setup a virtual server in Helsinki. On this server I deployed:
WindowUpdateMode::OnRead
.WindowUpdateMode::OnReceive
.First off, to establish a baseline, here is the output of an iperf run:
iperf -c xxx.xxxx.xxx.xxx
------------------------------------------------------------
Client connecting to xxx.xxx.xxx.xxx, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.2.107 port 56634 connected with xxx.xxx.xxx.xxx port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.9 sec 14.5 MBytes 11.2 Mbits/sec
Running option (1) (current develop branch):
./client-old --server-address /ip4/xxx.xxx.xxx.xxx/tcp/9992
Interval Transfer Bandwidth
0 s - 10.26 s 9 MBytes 7.01 MBit/s
Running option (2) (early window updates, small frames):
./client-new --server-address /ip4/xxx.xxx.xxx.xxx/tcp/9992
Interval Transfer Bandwidth
0 s - 10.04 s 11 MBytes 8.76 MBit/s
I would boldly claim that this whole effort does get us a bit closer to what a plain TCP connection can do :tada:.
Flow control in Yamux and HTTP/2
The Yamux flow control mechanism is very similar to HTTP/2's flow control. This is to no surprise, given that Yamux is inspired by the early SPDY efforts.
In both Yamux and HTTP/2 the WindowUpdate message is the integral part of the flow control mechanism.
https://github.com/hashicorp/yamux/blob/master/spec.md#flow-control
https://tools.ietf.org/html/rfc7540#section-5.2.1
In HTTP/2 it is up to the receiver when to send a WindowUpdate message. If I understand the short Yamux specification correctly, the same applies to Yamux.
See https://tools.ietf.org/html/rfc7540#section-5.2.1
For a general overview of HTTP/2 flow control I can recommend "HTTP/2 in Action" [1]. Chapter 7.2 on the topic can be (pre-)viewed on the publishers website.
WindowUpdate message strategies
HTTP/2 implementations can use WindowUpdate messages to implement various (possibly advanced) flow control strategies. One example of a simple strategy is the nghttp2 library which sends a WindowUpdate message once it (receiver) has received and consumed more than half of the flow control window.
Today with
WindowUpdateMode::OnRead
this Yamux implementation sends a WindowUpdate message once (a) the window credit has been fully depleted and (b) the read buffer is empty, thus all bytes have been consumed. See implementation for details.Comparison
Imagine the following simplified scenario:
A sender S is communicating with a receiver R. S wants to send 1 MB in multiple chunks to R. R uses a receive window of 256 KB (Yamux default). S and R are connected on some network inducing both a delay and a bandwidth constraint.
Algorithm 1 (Yamux today): Send WindowUpdate message once (a) window is fully depleted by sender and (b) receiver has consumed all send buffered bytes.
Once S has depleted its window, having sent 256 KB to R, S is blocked and has to wait for a WindowUpdate message to be send by R. This message is only send by R once 256 KB have been received and all of those are consumed. Thus every 256 KB the sender is blocked, having to wait for a whole round-trip (for all its data to arrive as well as for the WindowUpdate to be received).
Algorithm 2 (nghttp2): Sending WindowUpdate message once half or more of the window has been received and consumed.
While S sends the first 256 KB of data, R receives and consumes chunks of that data in parallel. Instead of waiting for the window to be fully depleted by S and fully consumed by R, R already sends a small WindowUpdate message (# bytes consumed of the current window) once half or more of the window has been depleted and consumed. This WindowUpdate will likely arrive at S before it depletes the entire window and thus S never stalls.
Summary
Long story short, to prevent senders from being blocked every time they have sent RECEIVE_WINDOW (256 KB) number of bytes, I am suggesting Yamux to adapt the same WindowUpdate strategy as nghttp2, namely to send a WindowUpdate message once half or more of the window has been received and consumed. An early benchmark with a patched Yamux using an adapted version of @twittner bench tool looks promising. I still have to do more testing on a high-latency network.
What do people think? Let me know if I am missing something. Happy for any pointers to similar discussions in the past.
[1] Pollard, Barry. HTTP/2 in Action. Manning, 2019.