The purpose of ACK aggregation and how to disable it

guolf3 commented 3 years ago

Hi, I have been testing lsquic and found that lsquic by default does the ACK aggregation. In standard kernel normally 1 ACK responds 1 or 2 data packets, but with ACK aggregation, 1 ACK acknowledges several data packets, in my case, 1 vs 10+. Since ACK aggregation will affect the bandwidth estimation for instance, I am wondering if ACK aggregation is enabled on purpose? And if there is a switch to disable it.

Also I found Cubic does not have Hystart implementation, any reason for excluding it? Thanks~

dtikhonov commented 3 years ago

Since ACK aggregation will affect the bandwidth estimation for instance

Not in QUIC, it won't. Have you observed performance deterioration due to ACK aggregation?

I am wondering if ACK aggregation is enabled on purpose?

This is not done on purpose. It is mostly due to the way lsquic is used. After incoming packets are processed and lsquic_engine_process_conns() is called, only one ACK is generated.

And if there is a switch to disable it.

One way to achieve what you want (whether it's wise is another question) is to call lsquic_engine_process_conns() after reading at most two incoming packets.

I found Cubic does not have Hystart implementation, any reason for excluding it?

A more correct way to put it is "why have you not implemented it?" The answer is: we haven't gotten around to it yet, as the performance is good enough. We do like to take PRs! :slightly_smiling_face:

guolf3 commented 3 years ago

THX dtikhonov.

I was working on TCP congestion control. Based on what I know, ACK aggregation will affect both Cubic and BBR in terms of the way they probe the bandwidth. For instance, in Slow-Start (or Startup phase), where the sending rate is double every RTT. Cubic relies on the Hystart to exist safely and BBR leaves Startup phase when it found no bandwidth increase with 3 consecutive ACK. However, with ACK aggregation, Cubic' Hystart won't perform correctly as the length of the ACK train is much smaller; And for BBR, when it found no bandwidth increase with 3 consecutive ACKs, it already overshoots the link several times. So I guess potentially excessive congestion losses will be an issue here, especially in shallow link buffer scenarios.

But I recalled that indeed in WIFI networks, upstream and downstream share the same bandwidth, few ACKs can slightly increase the data transmissions as the data packets can occupy more bandwidth.

To disable ACK aggregation, I have tried to look into the source codes and made two modifications: First, I modified function "read_handler", such that it will call function "prog_process_conns" in every iteration inside the for loop, the original implementation will only call "prog_process_conns" once. Second, function "read_handler" is a callback function, and the trigger condition is "EV_READ|EV_PERSIST". I added another timer so that "read_handler" is be invoked every 0.01ms. The first modification seems to be effective, but still cannot achieve "one ACK acking two data packets", while the second one does not help. I guess here the "EV_READ" needs to be modified as per reception of the data packet in order to achieve the goal. But I am not sure whether it can be done.

dtikhonov commented 3 years ago

Thanks for the explanation. We have not observed worse result due to ACK aggregation. We'll make a note to check it out next time we do performance testing.

Are you using IETF QUIC? This is the code that controls ACK queuing:

static void
try_queueing_ack_app (struct ietf_full_conn *conn,
                    enum was_missing was_missing, int ecn, lsquic_time_t now)
{   
    lsquic_time_t srtt, ack_timeout;

    if (conn->ifc_n_slack_akbl[PNS_APP] >= conn->ifc_max_retx_since_last_ack
/* From [draft-ietf-quic-transport-29] Section 13.2.1:
 " Similarly, packets marked with the ECN Congestion Experienced (CE)
 " codepoint in the IP header SHOULD be acknowledged immediately, to
 " reduce the peer's response time to congestion events.
 */
            || (ecn == ECN_CE
                    && lsquic_send_ctl_ecn_turned_on(&conn->ifc_send_ctl))
            || (was_missing == WM_MAX_GAP)
            || ((conn->ifc_flags & IFC_ACK_HAD_MISS)
                    && was_missing == WM_SMALLER
                    && conn->ifc_n_slack_akbl[PNS_APP] > 0)
            || many_in_and_will_write(conn))
    {   
        lsquic_alarmset_unset(&conn->ifc_alset, AL_ACK_APP);
        lsquic_send_ctl_sanity_check(&conn->ifc_send_ctl);
        conn->ifc_flags |= IFC_ACK_QUED_APP;
        LSQ_DEBUG("%s ACK queued: ackable: %u; all: %u; had_miss: %d; "
            "was_missing: %d",
            lsquic_pns2str[PNS_APP], conn->ifc_n_slack_akbl[PNS_APP],
            conn->ifc_n_slack_all,
            !!(conn->ifc_flags & IFC_ACK_HAD_MISS), (int) was_missing);
    }
    else if (conn->ifc_n_slack_akbl[PNS_APP] > 0)
    {   
        if (!lsquic_alarmset_is_set(&conn->ifc_alset, AL_ACK_APP))
        {   
            /* See https://github.com/quicwg/base-drafts/issues/3304 for more */
            srtt = lsquic_rtt_stats_get_srtt(&conn->ifc_pub.rtt_stats);
            if (srtt)
                ack_timeout = MAX(1000, MIN(conn->ifc_max_ack_delay, srtt / 4));
            else
                ack_timeout = conn->ifc_max_ack_delay;
            lsquic_alarmset_set(&conn->ifc_alset, AL_ACK_APP,
                                                            now + ack_timeout);
            LSQ_DEBUG("%s ACK alarm set to %"PRIu64, lsquic_pns2str[PNS_APP],
                                                            now + ack_timeout);
        }
        else
            LSQ_DEBUG("%s ACK alarm already set to %"PRIu64" usec from now",
                lsquic_pns2str[PNS_APP],
                conn->ifc_alset.as_expiry[AL_ACK_APP] - now);
    }
}

The value of ifc_max_retx_since_last_ack is 2 by default, so as long as at least that many ACK-eliciting packets (that is, packets that carry more than just ACKs and padding) were read, an ACK would be scheduled. If you pass -l conn=debug you will see all the debug messages shown above. You can limit the number of packets read at a time to two to see what happens.

I don't believe calling read_handler() on a timer like that will do anything. The event loop has very few events on it, there should be no competing events when a socket is readable.

guolf3 commented 3 years ago

Thanks, I got it now. TWO steps to achieve one ACK vs one data pkt.

Changing ifc_max_retx_since_last_ack from 2 two 1.
Let function prog_process_conns to be invoked in every iteration of the for loop (inside function read_handler). The original implementation, function prog_process_conns will only be invoked once.

In addition, I also saw some valuable discussions over ACK aggregation here https://github.com/quicwg/base-drafts/issues/3304. I think it is still an open question at this point. I agree that when a flow is stabilized we may not need so many ACKs hence ACK aggregation can be a good choice for us, especially in high-speed networks. And the exist of Slow-Start or Startup can possiblly be the turning point in terms of stabilization. But before that we can still consider more ACKs, say one ACK acking two data pkts.

dtikhonov commented 3 years ago

But before that we can still consider more ACKs, say one ACK acking two data pkts.

There is just no good way to do that, as reading of packets is not driven by the library at all. The code that reads from the socket does not know at which point the corresponding connection is.

You did not answer before:

Have you observed performance deterioration (using lsquic) due to ACK aggregation?

Apropos, lsquic also implements the Delayed ACKs extension, which results in even fewer ACKs (It has to be negotiated, though.)

guolf3 commented 3 years ago

In terms of performance deterioration, I didn't comprehensively test lsquic in the emulated environment. However, In my previous evaluation of the Linux kernel TCP Cubic, ACK aggregation indeed will affect the overall performance if the router's buffer is shallow. This is because Cubic cannot leave Slow-Start phase (where the sending rate is doubled every RTT) as desired (Hystart is proposed to address this issue, but its design principle is conflicted with ACK aggregation). This eventually leads to excessive congestion losses.

For BBR, it is facing the same issue, even without ACK aggregation, but ACK aggregation will make it worse in my opioion. Several papers have reported this issue (see https://www.ietf.org/proceedings/100/slides/slides-100-iccrg-a-quick-bbr-update-bbr-in-shallow-buffers-00.pdf), and BBRV2 in fact addresses this problem by adding a loss rate threshold (2%) as a criterion to stop the exponential increase of sending rate (to BBR, this phase is Startup, to other TCP, it is Slow-Start).

Hence, this is the reason I prefer not to do ACK aggregation in the initial bandwidth probing.

dtikhonov commented 3 years ago

We will take a look into this. Thank you for bringing it up.

MPK1 commented 2 years ago

I am currently evaluating the ACK frequency of lsquic by transmitting a single 8GB file from server to client (over a 10G link) and counting the packets (mostly ACKs) sent by the client. It turned out that in this case, delayed ACKs do not influence the total amount of packets sent by the client too much. However, the number of packets sent by the client fluctuates a lot when BBR is used. I performed 40 measurements, from which the first 20 were using BBR for client and server and the remaining 20 were using CUBIC instead. Delayed ACKs were not turned off.

Click me for Measurement Results

![fig](https://user-images.githubusercontent.com/12881914/198640071-a57aef02-a509-4392-89a6-ab95f17516f6.png) | Metric | CCA | Mean | Std | |-------------------------|-------|-----------:|----------:| | #packets client->server | BBR | `1252527.37` | `114012.53` | | #packets client->server | CUBIC | `229630.81` | `10429.45` | | #packets server->client | BBR | `5983411.89` | `234.20` | | #packets server->client | CUBIC | `5996707.05` | `1708.80` |

While the lsquic client seems to send approximately one ACK for every 26 packets arrived with the server using CUBIC, it seems to send way more packets (approximately one per 5 packets arrived) when the server is using BBR.

These results brought me here to further investigate into how lsquic determines the ACK frequency and why there is such a big difference in the ACK frequency for the different congestion control algorithms.

I would be very happy, if someone maybe knows more about how to the ACK frequency is determined or influenced in lsquic or even knows the reason for the observed difference in the ACK frequency and the resulting performance diffence.

Thank you!

litespeedtech / lsquic

The purpose of ACK aggregation and how to disable it #223