Optimize window update logic

jwt27 commented 9 months ago

I'm experimenting with the delayed-ACK / window update code. Some radical changes here, so this needs proper testing to ensure that it helps in all scenarios. And there may be bugs still.

RFC 1122 says:

A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment.

But in our case, the packet driver's "send" function is a major bottleneck. We should avoid sending small packets at all costs. The solution: remove fast-ACK altogether, delay as long as possible.

In addition, we should never advertise a receive window smaller than MSS. Ideally, keep it wide open. So, the window should be updated only from tcp_read(). It makes no sense to do so from FSM or retransmitter - they can't solve the problem.

In this patch I also allocate a receive buffer that is 2x the window size. Not sure yet if it's necessary.

Things to check:

New field tcp_Socket.recv_win is only updated for in-order data. That is likely not correct.
Round-trip timing may be affected on high-latency links.
Window size determination may not be optimal for the default 2K buffer.
Buffer size should be limited to 64K on 16-bit systems.

Test procedure with the bundled wget:

$ python3 -m http.server 8000 --bind 10.0.0.10

> wget-dj.exe 10.0.0.10:8000/hugefile.dat -T 0 -O /dev/null

And in wattcp.cfg: tcp.recv_win = 65536

Results so far look promising:

Before: ~4.0MB/s After: ~6.4MB/s

jwt27 commented 9 months ago

What this looks like in Wireshark, before: After:

gvanem commented 9 months ago

Results so far look promising:

Before: ~4.0MB/s After: ~6.4MB/s

AFAICS, the lack of receive-speed in Watt-32 programs (vs. Winsock or Linux) is the lack of SACK (Selective Acknowledge) algorithm.

And FACK (Forward Acknowledgement), but not sure how common this is nowadays.

There is an option USE_TCP_SACK in config.h, but it's tricky to implement. There's only some dummy tcp_opt_sack*() functions now. But I've scraped an old SACK-implementation and added it here.

jwt27 commented 9 months ago

I read a bit about SACK before. Doesn't it only help for retransmitted segments (which should be rare)?

On that topic, I noticed something. If you let wget write to a file, the bottleneck then becomes DOS and the hard drive. A 64K receive window is then unreliable since we only have 40 packet buffers (57K), and we end up dropping some. That looks like this:

Now we send a dup-ACK for every packet received after the gap. Limiting this to 3 would already save some time spent in TCP_SEND(). Then the peer starts retransmitting one segment at a time (and somehow we have a few more dup-ACKs here). At this point we can do fast-ACK to recover from retransmission more quickly. I added a commit to implement this:

jwt27 commented 9 months ago

BTW, if we use a giant buffer size in wget:

diff --git a/bin/WGET.182/retr.c b/bin/WGET.182/retr.c
index ebe1fe5..f78aad0 100644
--- a/bin/WGET.182/retr.c
+++ b/bin/WGET.182/retr.c
@@ -141,7 +141,7 @@ get_contents (int fd, FILE *fp, long *len, long restval, long expected,
 #ifdef __LARGE__
   static char c[2048];
 #else
-  static char c[8192];
+  static char c[131072];
 #endif
   void *progress = NULL;
   struct wget_timer *timer = wtimer_allocate ();

We can get a better estimate of maximum possible throughput. The difference then becomes:

Before: ~5.3 MB/s (current master branch) After: ~8.8 MB/s (this PR)

Which is pretty close to maximum link speed!

gvanem commented 9 months ago

Which is pretty close to maximum link speed!

Pretty impressive work there! 🥇

But all this is with a djgpp version of Wget on your LAN? Have you tested with a Large-model or Windows (with Watcom etc. if you have that) on a remote server. Like between Japan <--> Belgium.

jwt27 commented 9 months ago

Yes, this needs to be tested in different scenarios. I feel like I might be optimizing for one specific use case. On high-latency connections I expect it will fall apart, but haven't really tried yet.

I did try building wget with watcom but it complains about the makefile syntax. I'm really not familiar with the watcom toolset at all, would appreciate some help here (if anyone else happens to listen in on this).

jwt27 commented 9 months ago

Found a server on the other side of the planet (ping 290ms).

$ wget http://mirror.linux.org.au/debian/dists/unstable/main/Contents-source.gz -T 0 -O /dev/null

Windows (MSYS2): Constant 850 KB/s Linux: Starts slow, but increases gradually to 10 MB/s (wow!) Watt32 on djgpp (current master branch): Constant 65 KB/s Watt32 on djgpp (this PR): Constant 40 KB/s

Oops :) But not entirely unexpected. Let's think about this some more...

gvanem commented 9 months ago

Windows (MSYS2): Constant 850 KB/s

What's your max WAN link-speed? I'm getting 5.09MB/s with wget/Windows (MSVC) on my 100 MBit/s WAN.

Linux: Starts slow, but increases gradually to 10 MB/s (wow!)

Nice. Some tricks inside Linux that does this?

gvanem commented 9 months ago

On 2nd attempt, Windows is similar to Linux: wget-speed

jwt27 commented 9 months ago

I'm on a gigabit link, so no bottleneck there. Also I tested on Windows 7, maybe 10/11 has better TCP implementation (I should upgrade at some point).

For Linux I guess they have accurate round-trip timing to know when to send earlier window updates. But that's just a guess.

jwt27 commented 9 months ago

Looking at Wireshark on the Linux side:

It uses the TCP timestamp option and window size of 3MB. Window size does always stay constant. ACKs are sent fairly late.

Also note the effect of TCP offloading, the network card bundles up multiple packets and presents them to the OS as one large block. We don't have this luxury here :)

gvanem commented 9 months ago

Also note the effect of TCP offloading,

If that include checksum offloading, I've disabled that to see the calculated checksums in Wireshark etc. So the Windows speed is comparable to Linux AFAICS; starts slow but increases gradually to 8 MB/s here.

I'm really not familiar with the watcom toolset at all,

Wmake is really weird and ugly compared to GNU-make. I'll make an USE_WATCOM=1 target for Makefile.Windows.

jwt27 commented 9 months ago

Thinking out loud:

Problem is the delay between receiving the last packet and sending a window update. We can receive packets pretty quick, but then it takes a while before we get to process it. In theory sending ACKs faster with reduced window size will allow the sender to adapt and send at a more constant rate. But that's not happening, so something on our end must be causing a fairly constant delay.

If I were to redesign TCP, I would add some sort of window pre-announce mechanism. So a sender could say with the first packet "I will be transmitting N full segments now". And the receiver immediately replies "after those N segments, my window size will be at least X". Then the sender can always keep going without ever having to wait for a window update.

gvanem commented 9 months ago

But that's not happening, so something on our end must be causing a fairly constant delay.

Does this have something to do with Nagle's Algorithm?

jwt27 commented 9 months ago

Hm, not likely. I do think the Nagle implementation is slightly wrong, but that is another topic.

I've done some more experimenting. If I remove the window update treshold, the high-latency case improves to ~75 KB/s without hurting LAN too much (6.3 MB/s). Doing fast-ACK from FSM, I can get 100 KB/s (or even 200 with 64K window). But LAN really suffers.

What I think is happening then: By the time we get to tcp_tick(), there may already be a backlog of 20+ packets. If you do fast-ACK there, then by the time you're done with the first set of packets, 20 more may have arrived (since TCP_SEND() is so slow, and round-trip is very short). So you get stuck in tcp_tick() for a long time and are never able to clear the buffer.

jwt27 commented 9 months ago

To elaborate on the "slightly wrong" Nagle mode: While there is still unacked data, it's supposed to send only whole segments. But if we have a whole segment and a few bytes, Watt32 will also send a partial one. Small difference, but it is "slightly wrong".

There is also a modification to Nagle's algorithm described here. Could take a look at that sometime.

gvanem commented 9 months ago

Did you use the RDTSC or the 8254 timer as you mentioned in https://github.com/gvanem/Watt-32/issues/99?

jwt27 commented 8 months ago

Oh now you're on to something. I had been using the default, which I thought should be rdtsc. I saw you can toggle it with environment var USE_RDTSC.

So I set USE_RDTSC=0. Doesn't seem to make much of a difference. I enable it again with USE_RDTSC=1. Now I see 11.3MB/s on LAN somehow.

I'm very confused now. Need to figure out how this works.

jwt27 commented 8 months ago

This would all make sense if the timer code was using gettimeofday() since that is a very slow function. So that could be the "fairly constant delay" I was talking about earlier.

jwt27 commented 8 months ago

I did some benchmarking:

#include <stdio.h>
#include <tcp.h>

static uint64_t rdtsc()
{
  uint64_t count;
  asm volatile ("rdtsc" : "=A" (count));
  return count;
}

int main()
{
  uint64_t begin, end;

  init_misc();
  //hires_timer(0);

  begin = rdtsc();
  for (int i = 0; i < 100000; ++i)
  {
    set_timeout(0);
  }
  end = rdtsc();

  printf("%llu\n", end - begin);

  return 0;
}

hires_timer(0):  3387089559 cycles (avg 45161ns per call)
hires_timer(1):  3387605858 cycles (avg 45168ns per call)
USE_RDTSC=1   :    27900545 cycles (avg   372ns per call)

This shows that the timer code is a serious bottleneck - set_timeout() etc is used everywhere.

The default option, at least for djgpp, is equivalent to hires_timer(1), so 8254 enabled. It doesn't appear to have much effect here, because the 8254 code still calls libc time(), which is equally slow as gettimeofday().

RDTSC is decently fast on average, but it still calls gettimeofday() every now and then, so will cause random delays.

Question is, do we need exact UNIX/UTC time anywhere? For basic timeouts you can just start counting from 0 at program startup, and avoid all the slow time.h stuff from libc.

gvanem commented 8 months ago

Question is, do we need exact UNIX/UTC time anywhere? For basic timeouts you can just start counting from 0 at program startup, and avoid all the slow time.h stuff from libc.

Agreed. I'll try to make a modified version of src/tests/ttime.c that compares the CPU-clocks between various timer-functions.

gvanem / Watt-32

Optimize window update logic #108