Closed david-crespo closed 3 weeks ago
CLI chunks take about 600ms each while console POSTs seem to be taking about 2 seconds each.
Seeing some strange behavior when I compare doing one request at a time to multiple.
Pretty clearly bifurcated between ~900ms calls and ~2000ms calls with like 1/4 of them being the slow ones.
About the same.
Now nearly all the requests take 2 seconds.
@david-crespo I suspect this is due to the browser using HTTP/2, which is multiplexing a single TCP connection for all six of the requests we're running in parallel. The CLI is establishing 8 separate TCP connections, which are individually less efficient than the browser, but have much greater total throughput.
You can see in wireshark
that each of the CLI connections has about half the throughput of the Console's, 6.5 Mbps vs 13 Mbps, but combined they can push ~52 Mbps.
(Ignore the packet counts and total amount written in the screenshots, I didn't run these for the same amount of time.)
Yes, thank you! I just discovered this a few days ago as well, but did not have as good of a confirmation of what the CLI was doing. (Image on the right is going through a local proxy server that can't do http2, so it falls back to 1.1 and is faster).
I've tried looking at things we can change in the browser, but I wonder if there is something we can change on the server to give more bandwidth to that single TCP connection.
Let me caveat the following by noting that my understanding of TCP and network tunables in Illumos are limited.
I think the issue may be the relative small receive window being set by Nexus. For both CLI and Console, the maximum window I observe is ~128kb, which may be preventing a single connection from sending the same volume of in-flight traffic as the eight separate connections.
Here we can see the Console sending a burst of traffic, then waiting for the ACKs to come in before proceeding.
The CLI shows a similar pattern, but more extreme, presumably this is because we're dealing with a single logical stream.
According to some very old Solaris docs, tcp_recv_hiwat controls the default receive window size. Poking around in a Nexus zone, I see the following tcp settings:
root@oxz_nexus_65a11c18:~# ipadm show-prop tcp
PROTO PROPERTY PERM CURRENT PERSISTENT DEFAULT POSSIBLE
tcp congestion_control rw sunreno -- sunreno sunreno,newreno,
cubic
tcp ecn rw passive -- passive never,passive,
active
tcp extra_priv_ports rw 2049,4045 -- 2049,4045 1-65535
tcp largest_anon_port rw 65535 -- 65535 1024-65535
tcp max_buf rw 1048576 -- 1048576 8192-1073741824
tcp recv_buf rw 128000 -- 128000 2048-1048576
tcp sack rw active -- active never,passive,
active
tcp send_buf rw 49152 -- 49152 4096-1048576
tcp smallest_anon_port rw 32768 -- 32768 1024-65535
tcp smallest_nonpriv_port rw 1024 -- 1024 1024-32768
root@oxz_nexus_65a11c18:~# ndd -get /dev/tcp tcp_recv_hiwat
128000
Interestingly, RTT for the Console is ~2x shorter than for the CLI. My naive guess is that is due to the larger number of connections, but I have done nothing to validate this.
Average RTT for Console is 0.07 seconds:
$ tshark -Y "(ip.src[0]==45 && ip.src[1]==154) || (ip.dst[0]==45 && ip.dst[1]==154)" -r ~/console.pcapng.lz4 -Tfields -e "tcp.analysis.ack_rtt" | agrind '* | parse "*" as rtt | avg(rtt), max(rtt), min(rtt), p99(rtt)'
_average _max _min p99
----------------------------------------------------
0.07 0.18 0.00 0.08
Average RTT for CLI is 0.13 seconds:
$ tshark -Y "(ip.src[0]==45 && ip.src[1]==154) || (ip.dst[0]==45 && ip.dst[1]==154)" -r ~/cli-2.pcapng.lz4 -Tfields -e "tcp.analysis.ack_rtt" | agrind '* | parse "*" as rtt | avg(rtt), max(rtt), min(rtt), p99(rtt)'
_average _max _min p99
----------------------------------------------------
0.13 0.88 0.00 0.21
Maximum throughput in Mbps should be roughly $\frac{windowSizeInBits / RTTSeconds}{1000000}$
Theoretical max for Console: 14.63 Mbps Theoretical max for one CLI connection: 7.88 Mbps
These are a bit faster than what we see in practice, but we're not always at the maximum window size. Does this seem plausible to you?
Very interesting. I bet we can test that on dogfood by simply increasing the value of recv_buf
.
Check this out: https://www.illumos.org/issues/15019
We temporarily increased the nexus zones' recv_buf
on colo by 8x to 1,024,000 bytes with:
$ pilot host exec -c "zoneadm list | grep nexus | xargs -t -I% zlogin % ipadm set-prop -t -p recv_buf=1024000 tcp" 0-31
$ pilot host exec -c "zoneadm list | grep nexus | xargs -t -I% svcadm -z % restart nexus" 0-31
In my testing with Console, upload speeds went from 13 Mbps to 40 Mbps:
RTT increased a bit:
$ % tshark -Y "(ip.src[0]==45 && ip.src[1]==154) || (ip.dst[0]==45 && ip.dst[1]==154)" -r ~/console-large-buf.pcapng.lz4 -Tfields -e "tcp.analysis.ack_rtt" | agrind '* | parse "*" as rtt | avg(rtt), max(rtt), min(rtt), p99(rtt)'
_average _max _min p99
----------------------------------------------------
0.13 0.34 0.00 0.24
Peak window size increased to ~1 MB, as expected:
For the CLI, the throughput went down from 52 Mbps to 45 Mbps, but I suspect this may due to fluctuations in the maximum upload bandwidth available to my home internet connection.
RTT increased for the CLI as well:
$ tshark -Y "(ip.src[0]==45 && ip.src[1]==154) || (ip.dst[0]==45 && ip.dst[1]==154)" -r ~/cli-large-buf.pcapng.lz4 -Tfields -e "tcp.analysis.ack_rtt" | agrind '* | parse "*" as rtt | avg(rtt), max(rtt), min(rtt), p99(rtt)'
_average _max _min p99
----------------------------------------------------
0.47 3.42 0.00 0.92
@wfchandler you are a magician, great work!
I've opened https://github.com/oxidecomputer/omicron/issues/6601 to track a fix from the control plane side.
I'm going to close this because there's no work to do in the console. Thanks!
My manual testing confirmed @taspelund's report that the CLI is at least 5x faster at uploading images than the web console. The CLI uses 8 threads instead of the 6 we're limited to in the browser, but obviously that alone can't explain this. Needs more investigation.
I was at least able to confirm that skipping empty chunks is working as expected on web, so that is not the problem. The individual requests seem to just take longer.