Closed mpoke closed 4 years ago
In case the above behavior is indeed a bug, a simple fix could consist of the modification of the following lines in TcpConnection::sendData():
// send < MSS segments only if it's the only segment we can send now
// Note: if (bytesToSend == 1010 && MSS == 1012 && ts_enabled == true) => we may send
// 2 segments (1000 payload + 12 optionsHeader and 10 payload + 12 optionsHeader)
// FIXME this should probably obey Nagle's alg -- to be checked
// UPDATE: the following check should be against effectiveMaxBytesSend instead of state->snd_mss
if (bytesToSend <= effectiveMaxBytesSend) {
sendSegment(bytesToSend);
ASSERT(bytesToSend >= state->sentBytes);
bytesToSend -= state->sentBytes;
}
else { // send whole segments only (nagle_enabled)
while (bytesToSend >= effectiveMaxBytesSend) {
sendSegment(state->snd_mss);
ASSERT(bytesToSend >= state->sentBytes);
bytesToSend -= state->sentBytes;
}
}
// check how many bytes we have - last segment could be less than state->snd_mss
buffered = sendQueue->getBytesAvailable(state->snd_nxt);
// UPDATE: the last non-full segment should not be sent if Nagle's algorithm is enabled
if (!fullSegmentsOnly && bytesToSend == buffered && buffered != 0) // last segment?
sendSegment(bytesToSend);
else if (fullSegmentsOnly)
EV_DETAIL << "Cannot send last segment, not enough data for a full segment\n";
else if (bytesToSend > 0)
EV_DETAIL << bytesToSend << " bytes of space left in effectiveWindow\n";
Thanks a lot for the report and submission. @ZoltanBojthe could you take a look and add the suggested fix? I
Currently looking at the issue. The working branch is topic/nagle-fix, your code suggestion being the first commit.
I added a test case (tests/module/tcp_nagle_3.test) for sending large messages, and the output still isn't convincing. In the message sequence starting at 2.001, the 3rd segment (7049:8001(952)
) is fractional and should not have been sent without waiting for an ACK / merged with subsequent data. There must be still some issue with the code.
The issue seems to be that for the 3rd segment, sendData()
is called with fullSegmentsOnly=false due to the following line in TcpBaseAlg::sendData(..)
:
bool fullSegmentsOnly = sendCommandInvoked && state->nagle_enabled && state->snd_una != state->snd_max;
The 3rd term (state->snd_una != state->snd_max
), which is supposed to mean "we have unacknowledged data", is false
.
The values of state->snd_una
and state->snd_max
are both 5001, while we just previously sent seqNo 7049. Apparently, state->snd_max
hasn't been updated at this point?
The issue actually appears earlier when A is sending 1928 bytes to B:
[1.207 B003] A.1000 < B.2000: A ack 3073 win 16384 <---- receive ACK for all already sent data
[1.209 A006] A.1000 > B.2000: A 3073:4097(1024) ack 501 win 16384 <---- send 1024 bytes that are not yet ACKed
[1.209 A007] A.1000 > B.2000: A 4097:5001(904) ack 501 win 16384 <---- FIXME not good!!! should have waited
The problem is that fullSegmentsOnly is set to false when there is no unacknowledged data or when sendCommandInvoked is false , i.e.,
bool TcpBaseAlg::sendData(bool sendCommandInvoked)
{
//
// Nagle's algorithm: when a TCP connection has outstanding data that has not
// yet been acknowledged, small segments cannot be sent until the outstanding
// data is acknowledged. (In this case, small amounts of data are collected
// by TCP and sent in a single segment.)
//
// FIXME there's also something like this: can still send if
// "b) a segment that can be sent is at least half the size of
// the largest window ever advertised by the receiver"
bool fullSegmentsOnly = sendCommandInvoked && state->nagle_enabled && state->snd_una != state->snd_max;
if (fullSegmentsOnly)
EV_INFO << "Nagle is enabled and there's unacked data: only full segments will be sent\n";
// ...
}
The fix should include the following modifications:
bool TcpConnection::sendData(bool fullSegmentsOnly, uint32 congestionWindow) {
// ...
uint32 effectiveMss = state->snd_mss - options_len;
// Nagle's algorithm: when a TCP connection has outstanding data that has not
// yet been acknowledged, small segments cannot be sent until the outstanding
// data is acknowledged.
bool unacknowledgedData = (state->snd_una != state->snd_max);
if (state->nagle_enabled && unacknowledgedData && bytesToSend < effectiveMss) {
EV_WARN << "Cannot send, not enough data for a full segment (SMSS=" << state->snd_mss
<< ", effectiveWindow=" << effectiveWin << ", bytesToSend=" << bytesToSend << ", in buffer " << buffered << ")\n";
return false;
}
// start sending 'bytesToSend' bytes
EV_INFO << "Will send " << bytesToSend << " bytes (effectiveWindow " << effectiveWin
<< ", in buffer " << buffered << " bytes)\n";
// ...
// send < MSS segments only if it's the only segment we can send now
// Note: if (bytesToSend == 1010 && MSS == 1012 && ts_enabled == true) => we may send
// 2 segments (1000 payload + 12 optionsHeader and 10 payload + 12 optionsHeader)
if (bytesToSend <= effectiveMss) {
sendSegment(bytesToSend);
ASSERT(bytesToSend >= state->sentBytes);
bytesToSend -= state->sentBytes;
}
else { // send whole segments only (nagle_enabled)
while (bytesToSend >= effectiveMss) {
sendSegment(state->snd_mss);
ASSERT(bytesToSend >= state->sentBytes);
bytesToSend -= state->sentBytes;
}
}
if (state->nagle_enabled) {
// If Nagle's algorithm is enabled, by this point at least one full segment was sent;
// these already sent segments are not ACKed yet.
// Also, cannot form a full segment with the remaining buffered data.
// Thus, do not send last segment
EV_DETAIL << "Cannot send last segment, not enough data for a full segment\n";
}
else if (bytesToSend > 0) {
// Send last segment
ASSERT(bytesToSend == sendQueue->getBytesAvailable(state->snd_nxt));
sendSegment(bytesToSend);
}
// remember highest seq sent (snd_nxt may be set back on retransmission,
// ...
}
Clearly this fix makes the fullSegmentsOnly parameter of TcpConnection::sendData() irrelevant. For consistency, the code should be updated in TcpBaseAlg::sendData(bool sendCommandInvoked), TcpNoCongestionControl::sendData(bool sendCommandInvoked) and TcpBaseAlg::receivedDuplicateAck(), e.g.,
// no longer needed
bool fullSegmentsOnly = state->nagle_enabled && state->snd_una != state->snd_max;
A second problem that I found is that the function for printing segment info, i.e., TcpConnection::printSegmentBrief(), is misleading when a segment is sent over IP, i.e.,
bool TcpConnection::sendData(bool fullSegmentsOnly, uint32 congestionWindow) {
// ...
uint32 effectiveMss = state->snd_mss - options_len;
// ...
while (bytesToSend >= effectiveMss) {
sendSegment(state->snd_mss);
// ...
}
// ...
}
void TcpConnection::sendSegment(uint32 bytes) {
// ...
// if needed, remove from MSS the space needed by the header options (i.e., timestamps)
if (bytes + options_len > state->snd_mss)
bytes = state->snd_mss - options_len;
state->sentBytes = bytes;
// ...
sendToIP(packet, tcpseg); // tcpseg is not yet added to the packet
}
void TcpConnection::sendToIP(Packet *packet, const Ptr<TcpHeader>& tcpseg) {
// ...
EV_INFO << "Sending: ";
printSegmentBrief(packet, tcpseg); // tcpseg is still not added to the packet
// ...
insertTransportProtocolHeader(packet, Protocol::tcp, tcpseg); // now is tcpseg added to the packet
// ...
}
void TcpConnection::printSegmentBrief(Packet *packet, const Ptr<const TcpHeader>& tcpseg) {
// ...
// when arriving here from sendToIp(), the packet does not contain the TCP header yet
auto payloadLength = packet->getByteLength() - B(tcpseg->getHeaderLength()).get();
if (payloadLength > 0 || tcpseg->getSynBit()) {
EV_INFO << "[" << tcpseg->getSequenceNo() << ".." << (tcpseg->getSequenceNo() + payloadLength) << ") ";
EV_INFO << "(l=" << payloadLength << ") ";
}
// ...
}
As a result, when MSS is 1024, the logging looks something like this:
Will send 2048 bytes (effectiveWindow 2048, in buffer 3976 bytes)
Sending: .1000 > .2000: [1025..2029) (l=1004) ack 501 win 16384
Sending: .1000 > .2000: [2049..3053) (l=1004) ack 501 win 16384
Starting REXMIT timer
[1.205 A004] A.1000 > B.2000: A 1025:2049(1024) ack 501 win 16384
[1.205 A005] A.1000 > B.2000: A 2049:3073(1024) ack 501 win 16384
On a related note, how can I set the logging level to DEBUG for the TCP module? The following does not work:
*.*_tcp.cmdenv-log-level=DEBUG
**.cmdenv-log-level=OFF
I only get these lines in the output:
Tcp connection created for (inet::Request)ActiveOPEN
Tcp connection created for (inet::Request)PassiveOPEN
The only way I managed to make it work was by setting the logging level to DEBUG for the entire simulation, i.e.,
**.cmdenv-log-level=DEBUG
Let me answer the logging question first. The Tcp
module is a bit tricky, because it dynamically creates submodules to handle TCP connections (one module per connection). Therefore you need one more line with an additional asterisk:
*.*_tcp.cmdenv-log-level=DETAIL
*.*_tcp.*.cmdenv-log-level=DETAIL
**.cmdenv-log-level=OFF
But often I just use Qtenv, which makes it easier to browse the logs.
Let me answer the logging question first. The
Tcp
module is a bit tricky, because it dynamically creates submodules to handle TCP connections (one module per connection). Therefore you need one more line with an additional asterisk:*.*_tcp.cmdenv-log-level=DETAIL *.*_tcp.*.cmdenv-log-level=DETAIL **.cmdenv-log-level=OFF
This works indeed. Thanks.
But often I just use Qtenv, which makes it easier to browse the logs.
I also use Qtenv quite a lot, but for larger and longer runs it's quite slow; I find it more convinient to print specific warning messages and grep for them in the log. :)
I also use Qtenv quite a lot, but for larger and longer runs it's quite slow
Would it help to have something like an Express mode (near-cmdenv speed) that keeps the logs for later browsing?
Thanks for the new patch! I pushed the new changes to topic/nagle-fix. The Nagle tests look good now.
Some other TCP tests fail now: tcp_stresstest_1.test tcp_stresstest_2.test tcp_stresstest_3.test tcp_stresstest_msgq_1.test tcp_timestamp_2.test. The first 4 fail because the total transmission time and the number of packets changed -- which is expected, but I'll look into the traces to make sure the changes are justifiable. The last one currently fails with:
ASSERT: Condition 'bytesToSend == sendQueue->getBytesAvailable(state->snd_nxt)' does not hold in function 'sendData' at inet/transportlayer/tcp/TcpConnectionUtil.cc:937
I also use Qtenv quite a lot, but for larger and longer runs it's quite slow
Would it help to have something like an Express mode (near-cmdenv speed) that keeps the logs for later browsing?
That could be a nice feature, although I got quite used to using the command line.
Some other TCP tests fail now: tcp_stresstest_1.test tcp_stresstest_2.test tcp_stresstest_3.test tcp_stresstest_msgq_1.test tcp_timestamp_2.test. The first 4 fail because the total transmission time and the number of packets changed -- which is expected, but I'll look into the traces to make sure the changes are justifiable. The last one currently fails with:
ASSERT: Condition 'bytesToSend == sendQueue->getBytesAvailable(state->snd_nxt)' does not hold in function 'sendData' at inet/transportlayer/tcp/TcpConnectionUtil.cc:937
The last test fails due to an assert that I added without thinking it through, i.e.,
// in TcpConnection::sendData()
else if (bytesToSend > 0) {
// Send last segment
// [UPDATE] the following assertion is not needed:
// - bytesToSend is the amount of bytes still needed to be sent to fill the effective window
// - sendQueue->getBytesAvailable(state->snd_nxt) is the amount of bytes buffered;
// not all of them can be sent due to the window restriction
//
// ASSERT(bytesToSend == sendQueue->getBytesAvailable(state->snd_nxt));
sendSegment(bytesToSend);
}
In tcp_timestamp_2.test, before the ASSERT not holding, the amount of available bytes to be sent is 2500. However, the effective window is 512, which sets bytesToSend=512
. After sending an entire segment (note that MSS=512
), due to the timestamp option the TCP header increases with 12B, thus only 500B are sent. As a result, bytesToSend=12
and there are still 2000B buffered sendQueue->getBytesAvailable(state->snd_nxt)=2000
. Clearly, the ASSERT fails.
I would have guessed the expected behavior is to send the remaining 12B in another segment (since Nagle is disabled), regardless of the amount of bytes still buffered. Yet, in the original code, i.e.,
// check how many bytes we have - last segment could be less than state->snd_mss
buffered = sendQueue->getBytesAvailable(state->snd_nxt);
if (bytesToSend == buffered && buffered != 0) // last segment?
sendSegment(bytesToSend);
else if (bytesToSend > 0)
EV_DETAIL << bytesToSend << " bytes of space left in effectiveWindow\n";
the last segment is not sent if the amount of bytes still buffered is larger (even if there is space in the window). There probably is a reason for this decision; I don't really know since I'm not a TCP expert. The following changes will keep this behavior:
buffered = sendQueue->getBytesAvailable(state->snd_nxt);
if (state->nagle_enabled) {
// If Nagle's algorithm is enabled, by this point at least one full segment was sent;
// these already sent segments are not ACKed yet.
// Also, cannot form a full segment with the remaining buffered data.
// Thus, do not send last segment
EV_DETAIL << "Cannot send last segment, not enough data for a full segment\n";
}
else if (bytesToSend == buffered && buffered != 0) // last segment?
// Send last segment
sendSegment(bytesToSend);
}
else if (bytesToSend > 0)
EV_DETAIL << bytesToSend << " bytes of space left in effectiveWindow\n";
Thank you, I came to a similar conclusion in the meantime. Your change fixes it.
Btw, I find the logic in sendData() a little hard to follow and perhaps unnecessarily complicated at places, I'll see if I can do some refactoring to simplify it.
I cleaned up the topic branch a little, and added some refactoring commits (they don't break any of the module/tcp_* tests). Please have a look -- if it looks good to you as well, I'll put it into master.
I cleaned up the topic branch a little, and added some refactoring commits (they don't break any of the module/tcp_* tests). Please have a look -- if it looks good to you as well, I'll put it into master.
Thank you. It looks good.
The only question that I still have is why the last segment is not sent if the amount of bytes still buffered is larger (even if there is space in the window)? i.e.,
else if (bytesToSend == buffered && buffered != 0) // last segment?
// Send last segment
sendSegment(bytesToSend);
}
else if (bytesToSend > 0)
EV_DETAIL << bytesToSend << " bytes of space left in effectiveWindow\n";
Thanks for checking the code!
The only question that I still have is why the last segment is not sent if the amount of bytes still buffered is larger (even if there is space in the window)?
If there's several segments' worth of data buffered but effectiveWindow is smaller than MSS, sending a fractional segment immediately would result in sending the total data using possibly one more segment than necessary. So it sacrifices a little latency in favor of bandwidth usage. However, I'm not sure where this rule comes from, I'll try to find it in the RFCs...
The only question that I still have is why the last segment is not sent if the amount of bytes still buffered is larger (even if there is space in the window)?
So far I could not find any definitive statement in the RFCs about this. Actually, until recently, the INET code contained (a slightly broken variant of) the "send all" version too, using conditional compilation (#ifdef TCP_SENDFRAGMENTS), disabled by default.
However, I found pcap traces of TCP file transfers (e.g. tcp-ethereal-file1.trace on https://wiki.wireshark.org/SampleCaptures), which contain several instances of a burst of full-size segments followed by a fractional one. This points to the direction that you are right, and sending these fractional segments should be allowed.
Found one more bug: Nagle's alg should not be used if the segment contains FIN. (Rule also confirmed by the code in tcp_output.c in Linux.)
Andras, thanks for you help.
I have noticed a weird behavior of the INET TCP implementation when the Nagle algorithm is enabled. As expected, if there is unacknowledged data in the buffer, TCP will not allow segments smaller than MSS to be sent out. However, if the data to be sent is larger than MSS, then TCP will send out all of the data even if the last segment is smaller than MSS, i.e., in the TcpConnection::sendData() method of the TCP implementation of INET there is a comment that states that the "last segment could be less than state->snd_mss". This behavior is unexpected, since it means that by sending large messages the effects of the Nagle algorithm are circumvented.
I also evaluated the simulation of a simple application against a deployment on a real system. When sending messages smaller than MSS, the results match. However, for messages larger than MSS, the results of the real deployment stay the same, while in the simulation the message exchange is considerable faster (since TCP is not waiting for ACKs before sending segments that are not full). This behavior can also be visible in the following portion of the logs:
Is this a bug or am I missing something?