ietf-wg-tsvwg / tsvwg

1 stars 0 forks source link

Interaction w/ FQ AQMs #17

Open ietf-svn-bot opened 5 years ago

ietf-svn-bot commented 5 years ago

owner:draft-ietf-tsvwg-l4s-arch@ietf.org type_issue | by wes@mti-systems.com


There have been questions regarding:

This was discussed at IETF 105 (topic "B"). Jonathan Morton specifically mentioned it as a concern at the open microphone.


Issue migrated from trac:17 at 2022-01-31 12:34:54 +0000

ietf-svn-bot commented 5 years ago

@wes@mti-systems.com edited the issue description

ietf-svn-bot commented 5 years ago

@wes@mti-systems.com edited the issue description

ietf-svn-bot commented 4 years ago

@wes@mti-systems.com commented


This has been clarified, better described and elaborated on by Jonathan on the mailing list: https://mailarchive.ietf.org/arch/msg/tsvwg/kERw1493r7SC6ggKj68rh_scp_0

Testing as described in the message thread will produce results that help with some of the uncertainty in this case.

ietf-svn-bot commented 4 years ago

@wes@mti-systems.com changed _comment0 which not transferred by tractive

ietf-svn-bot commented 4 years ago

@chromatix99@gmail.com commented


FQ algorithms such as DRR++ (documented in RFC-8290) are very robust in preventing differences of response by some subset of flows controlled by them from affecting the throughput and delay available to other flows. The poor response noted under issue #16 is no exception, and the FQ system can successfully contain the standing queue such that only the L4S flows are affected by their own slow response to CE marks.

However, even this excellent performance by FQ is unable to completely mitigate the problem when the FQ node is immediately preceded by a single queue with only slightly higher throughput capacity, a situation which is presently fairly common when a smart CPE device is installed at one end of a dumb last-mile link. In this case the standing queue in the single queue can be observed to persist for several seconds, only slightly less time than the total standing queue, and with almost the same peak induced delay.

This may be referred to as the "consecutive bottlenecks" problem, wherein queuing may occur at multiple nodes, as it takes time for the first queue to drain into later ones after receiving a burst of excess traffic. This is not actually a failure of the interaction between L4S and FQ, but of the interaction between L4S and the AQM overlaid on the FQ, such that in certain circumstances the benefit of the FQ system is diluted.

ietf-svn-bot commented 4 years ago

@pete@heistp.net commented


Test results for consecutive bottlenecks are provided in Scenario 5 of the SCE-L4S "bakeoff" tests.

ietf-svn-bot commented 4 years ago

@jholland@akamai.com commented


I think it's maybe useful to add direct links to the most interesting l4s plots:

https://www.heistp.net/downloads/sce-l4s-bakeoff/bakeoff-2019-09-13T045427-[1]/l4s-s5-1/batch-l4s-s5-1-prague-50Mbit-80ms_fixed.png

ietf-svn-bot commented 4 years ago

@pete@heistp.net commented


Thanks Jake. :) To answer the comment on the second plot, yes, the that's the same result with variable scaling, such that the maximum TCP RTT value is 1/3 of the range for the axis. The fixed scale plots are such that the maximum RTT value is 100ms above the minimum, and are there for easier comparison between plots, whereas the variable scale plots are guaranteed to show the maximum values.

ietf-svn-bot commented 4 years ago

@pete@heistp.net commented


After the L4S code update as of tag testing/5-11-2019, the TCP RTT spikes at flow start have been worked around, but they can still occur if TCP Prague is after slow-start exit, as in the following plot (see the green trace):

https://www.heistp.net/downloads/sce-l4s-bakeoff/bakeoff-2019-11-11T090559-[2]/l4s-s5-2/batch-l4s-s5-2-prague-vs-cubic-50Mbit-80ms_fixed.png

ietf-svn-bot commented 4 years ago

@g.white@cablelabs.com commented


As discussed on the mailing list and during IETF106 tsvwg session (https://datatracker.ietf.org/meeting/106/materials/slides-106-tsvwg-sessa-72-l4s-drafts-00), this issue was the result of a bug introduced in the refactoring of tcp_dctcp.cc into tcp_prague.cc on July 30, 2019. Once fixed, this issue has been confirmed to be eliminated.

Based on this, I believe the consensus is that this issue can be closed.

Pete Heist's comment on Nov 11 points out a scenario in which a TCP Cubic flow starts up after a steady-state TCP Prague flow has been established. In this case, the TCP Prague "alpha" parameter has settled to a low value, and its cwnd is equal to the BDP (+5ms). When the Cubic flow starts up, the FQ scheduler more-or-less immediately cuts the available bandwidth for the TCP Prague flow to half its previous value (so the Prague cwnd is now approx. double what it should be), yet due to the slow evolution of the CoDel AQM control law, it takes several seconds for the queue to provide sufficient feedback to TCP Prague in order for it to reduce its cwnd. The result in the plot above is a period of 4 seconds in which the TCP Prague flow sees a larger queueing delay than would be desired.

This is a separate phenomenon that only affects the TCP Prague flow - other flows are unaffected. Also, this phenomenon can be eliminated if the CoDel AQM is upgraded to perform Immediate AQM for L4S flows.

ietf-svn-bot commented 4 years ago

@chromatix99@gmail.com commented


Coming back to this after it was mentioned on the list…

It's important to remember that this behaviour is not "L4S traffic encountering a non-L4S-compliant AQM", as Comment 9 implies. Rather, it is an increasingly common configuration of an RFC-3168 compliant network, being subjected to non-RFC-3168-compliant traffic. It is only because throughput fairness is enforced by a robust FQ component that the effect on the competing, RFC-compliant traffic is minimised.

In that context, I think the slow response of TCP Prague to an RFC-3168 AQM is something the L4S team should be more concerned about than they are, and that is why this issue was opened in the first place - to show that the problem is noticeable not only with single-queue AQMs but also with FQ. Even if the problem is contained to the TCP Prague flow, the latency performance is clearly impaired relative to conventional traffic, contrary to L4S' stated goal of improving latency.

ietf-svn-bot commented 4 years ago

@pete@heistp.net commented


Adding to this issue (since it's FQ related) that hash collisions in FQ AQMs do occur, and the frequency with which problems may occur is governed by a few things, including the size of the hash table (default 1024 for fq_codel), number of flows in the queue and percentage mix of flows. A calculation of the likelihood of problems is here:

https://docs.google.com/spreadsheets/d/1hOgTTZCKwR8f05Jjb3otJukh4gT6CKv-pnAJRqXEHLI/edit#gid=0