private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
527 stars 156 forks source link

BBR low sending rate when application-limited #1547

Closed alexrabi closed 10 months ago

alexrabi commented 10 months ago

Seems like there is a problem in the BBR implementation if the sender is application-limited rather than network limited. If the sender only sporadically has data to send, it appears that the CWND becomes very small. This severely limits the sending rate once the application actually has data to send.

A similar problem has already been described in https://github.com/private-octopus/picoquic/issues/1499#issuecomment-1645750380

huitema commented 10 months ago

This is a classic congestion control issue. BBR sets the pacing rate based on the highest value observed in the recent past. The definition of recent is debatable: if the highest rate was observed a long time ago, then the network conditions have probably changed and the observation is not valid anymore. If rates observed too recently are discarded, we get the situation that you describe.

Can you characterize the sending pattern of your applications? That would help setting a simulation, and understanding the issue more precisely.

alexrabi commented 10 months ago

BBR should ignore rate samples if it is application-limited, looking at the ietf draft, though I'm not sure if the picoquic implementation does this correctly.

The sending pattern is literally TCP traffic; the problem is observed when tunneling a TCP connection over QUIC datagrams. More specifically, the problem occurs when TCP is sending pure TCP ACKs, so this would be able to be simulated by sending one or more small datagrams roughly every round-trip-time (or slightly longer).

alexrabi commented 10 months ago

After doing some more tests, it looks this problem only happens when using multipath, and there is too little data to send to use more than one path. If there is only a single path available, there is no issue (whether multipath was negotiated or not).

huitema commented 10 months ago

Thanks for the precision. We test code includes test cases of limited senders, which are passing, but they obviously do not include your multipath configuration.

There is indeed a special case if there is no traffic at all on a path for some time. We probably need to add a test case for that.

huitema commented 10 months ago

Trying to understand your scenario. Apparently:

Is that a correct assessment?

Any idea of the duration of the "low traffic" interval?

alexrabi commented 10 months ago

The duration of the "low traffic" interval could essentially be infinite (e.g. a TCP flow generated by a traffic generator such as iperf), or at the very least very long (e.g. a large file download).

Funnily enough, it appears that if one creates a completely separate flow of datagrams that is only scheduled on the secondary path, the parameters on the primary path are affected (positively).

I have tried other congestion control algorithms (reno, cubic), and they do not display the same issue, though those algorithms are less than ideal for this use case for other reasons.

huitema commented 10 months ago

OK, I think I understand your scenario. It seems that there is a weird feedback loop happening, and I have to dig into it.

Just one more question. You say that the client only sends ACK. Shall I assume that this is true for the entire connection, i.e., the client is never sending more than a few ACK per RTT?

alexrabi commented 10 months ago

I think that that assumption would be fine. The way I am running things right now there is a tiny bit of negotiation between the client and server at the start and end of the connection (i.e. I am running iperf), though the client is limited to sending ACKs throughout the rest of the connection.

huitema commented 10 months ago

@alexrabi PR #1548 ought to solve you issue. I could not reproduce your exact scenario, but the PR fixes the handling of application limited scenarios in BBR, and also the accounting of small packets in the pacing code. I did add a test in which the traffic consists solely of small packets, and the PR causes a very significant improvement in the processing of these small packets.

alexrabi commented 10 months ago

@huitema After running a number of tests in my setup I can confirm that PR #1548 fixes the issue.

huitema commented 10 months ago

Thanks for the tests!