private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
540 stars 159 forks source link

bbr bandwidth is smaller than other algorithms #1483

Closed liangleikun closed 1 year ago

liangleikun commented 1 year ago

Thank you for your work! I tested different congestion algorithms using ‘picoquicdemo’. The bbr badwidth is smaller than other algorithms. I don't know what the problem is. The sever: ./picoquicdemo -k server-key-less-password.pem -c ca-cert.pem -p 4433 -F server_log2.csv

The client bbr: usr@4G Route:/tmp/mp# ./picoquicdemo -F client_log.csv -C 20 -q ./ -a perf 47.99.90.102 4433 "*1:0:-:50000000:500;" Starting Picoquic (v1.04b) connection to server = 47.99.90.102, port = 4433 No token file present. Will create one as . Getting ready to run QUICPERF Careful: NULL SNI is incompatible with HTTP 3. Expect errors! Max stream id bidir remote before start = 0 (0) Starting client connection. Version = 1, I-CID: ab31cb4d153ff20 Max stream id bidir remote after start = 0 (0) Waiting for packets. Client port (AF=2): 41485. Negotiated ALPN: perf Almost ready!

Connection established. Version = 1, I-CID: ab31cb4d153ff20, verified: 1 The connection is closed! Quic Bit was NOT greased by the client. Quic Bit was NOT greased by the server. ECN was not received. ECN was not acknowledged. Connection_duration_sec: 123.444726 Nb_transactions: 1 Upload_bytes: 50000000 Download_bytes: 500 TPS: 0.008101 Upload_Mbps: 3.240317 Download_Mbps: 0.000032 max_data_local: 536870912 max_max_stream_data_local: 536870912 max_data_remote: 254803968 max_max_stream_data_remote: 143543745 ack_delay_remote: 1000 ... 10000 max_ack_gap_remote: 2 ack_delay_local: 4770 ... 25000 max_ack_gap_local: 43 max_mtu_sent: 1440 max_mtu_received: 1440 Client exit with code = 0

image

The client cubic: usr@4G Route:/tmp/mp# ./picoquicdemo -G cubic -F client_log.csv -C 20 -q ./ -a perf 47.99.90.102 4433 "*1:0:-:50000000:500;" Starting Picoquic (v1.04b) connection to server = 47.99.90.102, port = 4433 Getting ready to run QUICPERF Careful: NULL SNI is incompatible with HTTP 3. Expect errors! Max stream id bidir remote before start = 0 (0) Starting client connection. Version = 1, I-CID: 2eead19088b105ff Max stream id bidir remote after start = 0 (0) Waiting for packets. Client port (AF=2): 50302. Negotiated ALPN: perf Almost ready!

Connection established. Version = 1, I-CID: 2eead19088b105ff, verified: 1 The connection is closed! Quic Bit was NOT greased by the client. Quic Bit was NOT greased by the server. ECN was not received. ECN was not acknowledged. Connection_duration_sec: 16.544891 Nb_transactions: 1 Upload_bytes: 50000000 Download_bytes: 500 TPS: 0.060442 Upload_Mbps: 24.176648 Download_Mbps: 0.000242 max_data_local: 536870912 max_max_stream_data_local: 536870912 max_data_remote: 254803968 max_max_stream_data_remote: 143543745 ack_delay_remote: 1000 ... 10000 max_ack_gap_remote: 2 ack_delay_local: 5318 ... 25000 max_ack_gap_local: 228 max_mtu_sent: 1440 max_mtu_received: 1440 Client exit with code = 0

image qlog.zip

huitema commented 1 year ago

Interesting. From the traces, it seems that BBR started well, but at some point decided to shrink the window to a minimal size. It would be nice to understand what caused that. Can you tell us more about the test conditions? What type of link? Are you using a simulator?

liangleikun commented 1 year ago

I tested it using the 4G network interface on my router. The router is in openwrt (mt7261 soc).

huitema commented 1 year ago

Something really surprising is happening. If I look at the details in the Qlog of the BBR run, I see the control parameters at t=2738.12 as:

{
  "cwnd": 8261348,
  "pacing_rate": 1645714285,
  "bytes_in_flight": 180089,
  "smoothed_rtt": 46005,
  "min_rtt": 20391,
  "latest_rtt": 45807
}

Notice the pacing rate of over 1.6 Gbps!

Then, I see an acknowledgement arriving, and at t=2744.22 the measurements become:

{
  "cwnd": 247608,
  "pacing_rate": 11260997,
  "bytes_in_flight": 168569,
  "smoothed_rtt": 46263,
  "min_rtt": 20391,
  "latest_rtt": 48075
}

That is, a drop from 1.6Gbps to about 11Mbps. There is no loss observed, so that means that after sending at a high speed rate, BBR has received for at least 6 RTT measurements at a rate of 11Mbps or lower. Also, a huge queue has built up. After that, BBR will see rates dropping more and more.

What happened? At the beginning of the connection, BBR progresses by trying to send ever faster. It sets a data rate at the beginning of 1-RTT epoch, to something like the previous datarate times sqrt(2). It is supposed to stop doing that if it observes that the measured bandwidth stops increasing, at which point it decreases a bit. I think that point was reached at t=2708.39, when the pacing rate started decreasing from 1.9Gbps to 1.6Gbps.

After that, BBR is supposed to "exit slow start", and compute the "backlog" of traffic -- the bytes in flight. It will send for a short while at a slow rate until the excess backlog is cleared, and then restarts at a moderate rate corresponding to the actual link capacity. I see that happening at t=2759.26, with the parameters now set to:

{
  "cwnd": 177823,
  "pacing_rate": 32450704,
  "bytes_in_flight": 74969,
  "smoothed_rtt": 43859,
  "min_rtt": 20391,
  "latest_rtt": 36208
}

The new rate of 32Mbps is probably close to what the network can carry. In a normal setup, that rate would be sustained until the end of the connection. This go on for a while. BBR tries to push the pacing rate to 40 Mpbs, then reduces that to 38 Mpbs, then 36 Mbps, then 28 Mbps, then 9.3 Mbps, 8.38 Mbps, 5 Mbps, 4.8 Mbps, 4.0 Mbps, 3.75 Mbps, and 2.25 Mbps. After that, I see BBR trying to increase the data rate, to 3 Mbps, 3.7 Mbps, then down to 3.5, etc. It keeps trying, but the measured data rates stay around 3 Mbps.

I think that we are seeing an interaction between BBR and some kind of policy enforcement in the network. First, BBR ramped up the bit rate to a very large value, because the network could carry that much data. But some mechanism kicked in, and decided to "punish" a connection that it probably classified as "too greedy". After that, BBR could only pass what the policing mechanism allowed.

My impression is that BBR triggered the policing mechanism when it reached the very high bit rates at the beginning of the connection. This is a known issue with network routers that try to do enforcement by measuring data rates but not building queues.