inet-framework / inet

INET Framework for the OMNeT++ discrete event simulator
https://inet.omnetpp.org
Other
431 stars 483 forks source link

INET 4.2.0: TCP Nagle's Algorithm #558

Closed mpoke closed 4 years ago

mpoke commented 4 years ago

I have noticed a weird behavior of the INET TCP implementation when the Nagle algorithm is enabled. As expected, if there is unacknowledged data in the buffer, TCP will not allow segments smaller than MSS to be sent out. However, if the data to be sent is larger than MSS, then TCP will send out all of the data even if the last segment is smaller than MSS, i.e., in the TcpConnection::sendData() method of the TCP implementation of INET there is a comment that states that the "last segment could be less than state->snd_mss". This behavior is unexpected, since it means that by sending large messages the effects of the Nagle algorithm are circumvented.

I also evaluated the simulation of a simple application against a deployment on a real system. When sending messages smaller than MSS, the results match. However, for messages larger than MSS, the results of the real deployment stay the same, while in the simulation the message exchange is considerable faster (since TCP is not waiting for ACKs before sending segments that are not full). This behavior can also be visible in the following portion of the logs:

[DETAIL]    Connection 10.0.0.1:1028 to 14.0.0.1:9000  on socketId=13  in ESTABLISHED
[INFO]    App command: SEND
[DETAIL]    4096 bytes in queue, plus 2048 bytes unacknowledged
[INFO]    Nagle is enabled and there's unacked data: only full segments will be sent
[INFO]    Tcp Header Option TS(TSval=3081, TSecr=3071) sent
[INFO]    Will send 2048 bytes (effectiveWindow 6712, in buffer 2048 bytes)
[INFO]    Tcp Header Option TS(TSval=3081, TSecr=3071) sent
[INFO]    rexmitQ: [254098..256146) enqueueSentData [256146..257594)
[INFO]    Sending: .1028 > .9000: [256146..257562) (l=1416) ack 258647 win 65535 options NOP NOP TS
[INFO]    Tcp Header Option TS(TSval=3081, TSecr=3071) sent
[INFO]    rexmitQ: [254098..257594) enqueueSentData [257594..258194)
[INFO]    Sending: .1028 > .9000: [257594..258162) (l=568) ack 258647 win 65535 options NOP NOP TS
[DETAIL]    Staying in state: ESTABLISHED (event was: SEND)

Is this a bug or am I missing something?

mpoke commented 4 years ago

In case the above behavior is indeed a bug, a simple fix could consist of the modification of the following lines in TcpConnection::sendData():

  // send < MSS segments only if it's the only segment we can send now
  // Note: if (bytesToSend == 1010 && MSS == 1012 && ts_enabled == true) => we may send
  // 2 segments (1000 payload + 12 optionsHeader and 10 payload + 12 optionsHeader)
  // FIXME this should probably obey Nagle's alg -- to be checked
// UPDATE: the following check should be against effectiveMaxBytesSend instead of state->snd_mss
if (bytesToSend <= effectiveMaxBytesSend) {
    sendSegment(bytesToSend);
    ASSERT(bytesToSend >= state->sentBytes);
    bytesToSend -= state->sentBytes;
}
else {    // send whole segments only (nagle_enabled)
    while (bytesToSend >= effectiveMaxBytesSend) {
        sendSegment(state->snd_mss);
        ASSERT(bytesToSend >= state->sentBytes);
        bytesToSend -= state->sentBytes;
    }
}

// check how many bytes we have - last segment could be less than state->snd_mss
buffered = sendQueue->getBytesAvailable(state->snd_nxt);

// UPDATE: the last non-full segment should not be sent if Nagle's algorithm is enabled
if (!fullSegmentsOnly && bytesToSend == buffered && buffered != 0) // last segment?
    sendSegment(bytesToSend);
else if (fullSegmentsOnly) 
    EV_DETAIL << "Cannot send last segment, not enough data for a full segment\n";
else if (bytesToSend > 0)
    EV_DETAIL << bytesToSend << " bytes of space left in effectiveWindow\n";
rhornig commented 4 years ago

Thanks a lot for the report and submission. @ZoltanBojthe could you take a look and add the suggested fix? I

avarga commented 4 years ago

Currently looking at the issue. The working branch is topic/nagle-fix, your code suggestion being the first commit.

I added a test case (tests/module/tcp_nagle_3.test) for sending large messages, and the output still isn't convincing. In the message sequence starting at 2.001, the 3rd segment (7049:8001(952)) is fractional and should not have been sent without waiting for an ACK / merged with subsequent data. There must be still some issue with the code.

avarga commented 4 years ago

The issue seems to be that for the 3rd segment, sendData() is called with fullSegmentsOnly=false due to the following line in TcpBaseAlg::sendData(..):

bool fullSegmentsOnly = sendCommandInvoked && state->nagle_enabled && state->snd_una != state->snd_max;

The 3rd term (state->snd_una != state->snd_max), which is supposed to mean "we have unacknowledged data", is false. The values of state->snd_una and state->snd_max are both 5001, while we just previously sent seqNo 7049. Apparently, state->snd_max hasn't been updated at this point?

mpoke commented 4 years ago

The issue actually appears earlier when A is sending 1928 bytes to B:

[1.207 B003] A.1000 < B.2000: A ack 3073 win 16384 <---- receive ACK for all already sent data
[1.209 A006] A.1000 > B.2000: A 3073:4097(1024) ack 501 win 16384 <---- send 1024 bytes that are not yet ACKed
[1.209 A007] A.1000 > B.2000: A 4097:5001(904) ack 501 win 16384 <---- FIXME not good!!! should have waited

The problem is that fullSegmentsOnly is set to false when there is no unacknowledged data or when sendCommandInvoked is false , i.e.,

bool TcpBaseAlg::sendData(bool sendCommandInvoked)
{
    //
    // Nagle's algorithm: when a TCP connection has outstanding data that has not
    // yet been acknowledged, small segments cannot be sent until the outstanding
    // data is acknowledged. (In this case, small amounts of data are collected
    // by TCP and sent in a single segment.)
    //
    // FIXME there's also something like this: can still send if
    // "b) a segment that can be sent is at least half the size of
    // the largest window ever advertised by the receiver"

    bool fullSegmentsOnly = sendCommandInvoked && state->nagle_enabled && state->snd_una != state->snd_max;

    if (fullSegmentsOnly)
        EV_INFO << "Nagle is enabled and there's unacked data: only full segments will be sent\n";
    // ...
}

The fix should include the following modifications:

bool TcpConnection::sendData(bool fullSegmentsOnly, uint32 congestionWindow) {
    // ...
    uint32 effectiveMss = state->snd_mss - options_len;

    // Nagle's algorithm: when a TCP connection has outstanding data that has not
    // yet been acknowledged, small segments cannot be sent until the outstanding
    // data is acknowledged.
    bool unacknowledgedData = (state->snd_una != state->snd_max);
    if (state->nagle_enabled && unacknowledgedData && bytesToSend < effectiveMss) {
        EV_WARN << "Cannot send, not enough data for a full segment (SMSS=" << state->snd_mss
                << ", effectiveWindow=" << effectiveWin << ", bytesToSend=" << bytesToSend << ", in buffer " << buffered << ")\n";
        return false;
    }

    // start sending 'bytesToSend' bytes
    EV_INFO << "Will send " << bytesToSend << " bytes (effectiveWindow " << effectiveWin
            << ", in buffer " << buffered << " bytes)\n";

    // ...
    // send < MSS segments only if it's the only segment we can send now
    // Note: if (bytesToSend == 1010 && MSS == 1012 && ts_enabled == true) => we may send
    // 2 segments (1000 payload + 12 optionsHeader and 10 payload + 12 optionsHeader)
    if (bytesToSend <= effectiveMss) {
        sendSegment(bytesToSend);
        ASSERT(bytesToSend >= state->sentBytes);
        bytesToSend -= state->sentBytes;
    }
    else {    // send whole segments only (nagle_enabled)
        while (bytesToSend >= effectiveMss) {
            sendSegment(state->snd_mss);
            ASSERT(bytesToSend >= state->sentBytes);
            bytesToSend -= state->sentBytes;
        }
    }

    if (state->nagle_enabled) {
        // If Nagle's algorithm is enabled, by this point at least one full segment was sent;
        // these already sent segments are not ACKed yet.
        // Also, cannot form a full segment with the remaining buffered data.
        // Thus, do not send last segment
        EV_DETAIL << "Cannot send last segment, not enough data for a full segment\n";
    }
    else if (bytesToSend > 0) {
        // Send last segment
        ASSERT(bytesToSend == sendQueue->getBytesAvailable(state->snd_nxt));
        sendSegment(bytesToSend);
    }

    // remember highest seq sent (snd_nxt may be set back on retransmission,
    // ...
}

Clearly this fix makes the fullSegmentsOnly parameter of TcpConnection::sendData() irrelevant. For consistency, the code should be updated in TcpBaseAlg::sendData(bool sendCommandInvoked), TcpNoCongestionControl::sendData(bool sendCommandInvoked) and TcpBaseAlg::receivedDuplicateAck(), e.g.,

// no longer needed
bool fullSegmentsOnly = state->nagle_enabled && state->snd_una != state->snd_max;

A second problem that I found is that the function for printing segment info, i.e., TcpConnection::printSegmentBrief(), is misleading when a segment is sent over IP, i.e.,

bool TcpConnection::sendData(bool fullSegmentsOnly, uint32 congestionWindow) {
  // ...
  uint32 effectiveMss = state->snd_mss - options_len;
  // ...
  while (bytesToSend >= effectiveMss) {
    sendSegment(state->snd_mss);
    // ...
  }
// ...
}

void TcpConnection::sendSegment(uint32 bytes) {
  // ...
  // if needed, remove from MSS the space needed by the header options (i.e., timestamps)
  if (bytes + options_len > state->snd_mss)
    bytes = state->snd_mss - options_len;
  state->sentBytes = bytes;
  // ...
  sendToIP(packet, tcpseg); // tcpseg is not yet added to the packet
}

void TcpConnection::sendToIP(Packet *packet, const Ptr<TcpHeader>& tcpseg) {
  // ...
  EV_INFO << "Sending: ";
  printSegmentBrief(packet, tcpseg); // tcpseg is still not added to the packet
  // ...
  insertTransportProtocolHeader(packet, Protocol::tcp, tcpseg); // now is tcpseg added to the packet
  // ...
}

void TcpConnection::printSegmentBrief(Packet *packet, const Ptr<const TcpHeader>& tcpseg) {
  // ... 
  // when arriving here from sendToIp(), the packet does not contain the TCP header yet
  auto payloadLength = packet->getByteLength() - B(tcpseg->getHeaderLength()).get();
  if (payloadLength > 0 || tcpseg->getSynBit()) {
    EV_INFO << "[" << tcpseg->getSequenceNo() << ".." << (tcpseg->getSequenceNo() + payloadLength) << ") ";
    EV_INFO << "(l=" << payloadLength << ") ";
  }
  // ...
}

As a result, when MSS is 1024, the logging looks something like this:

Will send 2048 bytes (effectiveWindow 2048, in buffer 3976 bytes)
Sending: .1000 > .2000: [1025..2029) (l=1004) ack 501 win 16384 
Sending: .1000 > .2000: [2049..3053) (l=1004) ack 501 win 16384 
Starting REXMIT timer
[1.205 A004] A.1000 > B.2000: A 1025:2049(1024) ack 501 win 16384 
[1.205 A005] A.1000 > B.2000: A 2049:3073(1024) ack 501 win 16384

On a related note, how can I set the logging level to DEBUG for the TCP module? The following does not work:

*.*_tcp.cmdenv-log-level=DEBUG
**.cmdenv-log-level=OFF

I only get these lines in the output:

Tcp connection created for (inet::Request)ActiveOPEN
Tcp connection created for (inet::Request)PassiveOPEN

The only way I managed to make it work was by setting the logging level to DEBUG for the entire simulation, i.e.,

**.cmdenv-log-level=DEBUG
avarga commented 4 years ago

Let me answer the logging question first. The Tcp module is a bit tricky, because it dynamically creates submodules to handle TCP connections (one module per connection). Therefore you need one more line with an additional asterisk:

  *.*_tcp.cmdenv-log-level=DETAIL
  *.*_tcp.*.cmdenv-log-level=DETAIL
  **.cmdenv-log-level=OFF

But often I just use Qtenv, which makes it easier to browse the logs.

mpoke commented 4 years ago

Let me answer the logging question first. The Tcp module is a bit tricky, because it dynamically creates submodules to handle TCP connections (one module per connection). Therefore you need one more line with an additional asterisk:

  *.*_tcp.cmdenv-log-level=DETAIL
  *.*_tcp.*.cmdenv-log-level=DETAIL
  **.cmdenv-log-level=OFF

This works indeed. Thanks.

But often I just use Qtenv, which makes it easier to browse the logs.

I also use Qtenv quite a lot, but for larger and longer runs it's quite slow; I find it more convinient to print specific warning messages and grep for them in the log. :)

avarga commented 4 years ago

I also use Qtenv quite a lot, but for larger and longer runs it's quite slow

Would it help to have something like an Express mode (near-cmdenv speed) that keeps the logs for later browsing?

Thanks for the new patch! I pushed the new changes to topic/nagle-fix. The Nagle tests look good now.

Some other TCP tests fail now: tcp_stresstest_1.test tcp_stresstest_2.test tcp_stresstest_3.test tcp_stresstest_msgq_1.test tcp_timestamp_2.test. The first 4 fail because the total transmission time and the number of packets changed -- which is expected, but I'll look into the traces to make sure the changes are justifiable. The last one currently fails with: ASSERT: Condition 'bytesToSend == sendQueue->getBytesAvailable(state->snd_nxt)' does not hold in function 'sendData' at inet/transportlayer/tcp/TcpConnectionUtil.cc:937

mpoke commented 4 years ago

I also use Qtenv quite a lot, but for larger and longer runs it's quite slow

Would it help to have something like an Express mode (near-cmdenv speed) that keeps the logs for later browsing?

That could be a nice feature, although I got quite used to using the command line.

Some other TCP tests fail now: tcp_stresstest_1.test tcp_stresstest_2.test tcp_stresstest_3.test tcp_stresstest_msgq_1.test tcp_timestamp_2.test. The first 4 fail because the total transmission time and the number of packets changed -- which is expected, but I'll look into the traces to make sure the changes are justifiable. The last one currently fails with: ASSERT: Condition 'bytesToSend == sendQueue->getBytesAvailable(state->snd_nxt)' does not hold in function 'sendData' at inet/transportlayer/tcp/TcpConnectionUtil.cc:937

The last test fails due to an assert that I added without thinking it through, i.e.,

// in TcpConnection::sendData()

    else if (bytesToSend > 0) {
        // Send last segment
        // [UPDATE] the following assertion is not needed:
        //  - bytesToSend is the amount of bytes still needed to be sent to fill the effective window
        //  - sendQueue->getBytesAvailable(state->snd_nxt) is the amount of bytes buffered;
        //    not all of them can be sent due to the window restriction
        //
        // ASSERT(bytesToSend == sendQueue->getBytesAvailable(state->snd_nxt));
        sendSegment(bytesToSend);
    }

In tcp_timestamp_2.test, before the ASSERT not holding, the amount of available bytes to be sent is 2500. However, the effective window is 512, which sets bytesToSend=512. After sending an entire segment (note that MSS=512), due to the timestamp option the TCP header increases with 12B, thus only 500B are sent. As a result, bytesToSend=12 and there are still 2000B buffered sendQueue->getBytesAvailable(state->snd_nxt)=2000. Clearly, the ASSERT fails.

I would have guessed the expected behavior is to send the remaining 12B in another segment (since Nagle is disabled), regardless of the amount of bytes still buffered. Yet, in the original code, i.e.,

    // check how many bytes we have - last segment could be less than state->snd_mss
    buffered = sendQueue->getBytesAvailable(state->snd_nxt);

    if (bytesToSend == buffered && buffered != 0) // last segment?
        sendSegment(bytesToSend);
    else if (bytesToSend > 0)
        EV_DETAIL << bytesToSend << " bytes of space left in effectiveWindow\n";

the last segment is not sent if the amount of bytes still buffered is larger (even if there is space in the window). There probably is a reason for this decision; I don't really know since I'm not a TCP expert. The following changes will keep this behavior:

    buffered = sendQueue->getBytesAvailable(state->snd_nxt);

    if (state->nagle_enabled) {
        // If Nagle's algorithm is enabled, by this point at least one full segment was sent;
        // these already sent segments are not ACKed yet.
        // Also, cannot form a full segment with the remaining buffered data.
        // Thus, do not send last segment
        EV_DETAIL << "Cannot send last segment, not enough data for a full segment\n";
    }
    else if (bytesToSend == buffered && buffered != 0) // last segment?
        // Send last segment
        sendSegment(bytesToSend);
    }
    else if (bytesToSend > 0)
        EV_DETAIL << bytesToSend << " bytes of space left in effectiveWindow\n";
avarga commented 4 years ago

Thank you, I came to a similar conclusion in the meantime. Your change fixes it.

Btw, I find the logic in sendData() a little hard to follow and perhaps unnecessarily complicated at places, I'll see if I can do some refactoring to simplify it.

avarga commented 4 years ago

I cleaned up the topic branch a little, and added some refactoring commits (they don't break any of the module/tcp_* tests). Please have a look -- if it looks good to you as well, I'll put it into master.

mpoke commented 4 years ago

I cleaned up the topic branch a little, and added some refactoring commits (they don't break any of the module/tcp_* tests). Please have a look -- if it looks good to you as well, I'll put it into master.

Thank you. It looks good.

The only question that I still have is why the last segment is not sent if the amount of bytes still buffered is larger (even if there is space in the window)? i.e.,

    else if (bytesToSend == buffered && buffered != 0) // last segment?
        // Send last segment
        sendSegment(bytesToSend);
    }
    else if (bytesToSend > 0)
        EV_DETAIL << bytesToSend << " bytes of space left in effectiveWindow\n";
avarga commented 4 years ago

Thanks for checking the code!

The only question that I still have is why the last segment is not sent if the amount of bytes still buffered is larger (even if there is space in the window)?

If there's several segments' worth of data buffered but effectiveWindow is smaller than MSS, sending a fractional segment immediately would result in sending the total data using possibly one more segment than necessary. So it sacrifices a little latency in favor of bandwidth usage. However, I'm not sure where this rule comes from, I'll try to find it in the RFCs...

avarga commented 4 years ago

The only question that I still have is why the last segment is not sent if the amount of bytes still buffered is larger (even if there is space in the window)?

So far I could not find any definitive statement in the RFCs about this. Actually, until recently, the INET code contained (a slightly broken variant of) the "send all" version too, using conditional compilation (#ifdef TCP_SENDFRAGMENTS), disabled by default.

However, I found pcap traces of TCP file transfers (e.g. tcp-ethereal-file1.trace on https://wiki.wireshark.org/SampleCaptures), which contain several instances of a burst of full-size segments followed by a fractional one. This points to the direction that you are right, and sending these fractional segments should be allowed.

avarga commented 4 years ago

Found one more bug: Nagle's alg should not be used if the segment contains FIN. (Rule also confirmed by the code in tcp_output.c in Linux.)

mpoke commented 4 years ago

Andras, thanks for you help.