Open jwt27 opened 9 months ago
What this looks like in Wireshark, before: After:
Results so far look promising:
Before: ~4.0MB/s After: ~6.4MB/s
AFAICS, the lack of receive-speed in Watt-32 programs (vs. Winsock or Linux) is the lack of SACK (Selective Acknowledge) algorithm.
And FACK (Forward Acknowledgement), but not sure how common this is nowadays.
There is an option USE_TCP_SACK
in config.h
, but it's tricky to implement. There's only some dummy tcp_opt_sack*()
functions now. But I've scraped an old SACK-implementation and added it here.
I read a bit about SACK before. Doesn't it only help for retransmitted segments (which should be rare)?
On that topic, I noticed something. If you let wget
write to a file, the bottleneck then becomes DOS and the hard drive. A 64K receive window is then unreliable since we only have 40 packet buffers (57K), and we end up dropping some. That looks like this:
Now we send a dup-ACK for every packet received after the gap. Limiting this to 3 would already save some time spent in TCP_SEND()
. Then the peer starts retransmitting one segment at a time (and somehow we have a few more dup-ACKs here). At this point we can do fast-ACK to recover from retransmission more quickly. I added a commit to implement this:
BTW, if we use a giant buffer size in wget:
diff --git a/bin/WGET.182/retr.c b/bin/WGET.182/retr.c
index ebe1fe5..f78aad0 100644
--- a/bin/WGET.182/retr.c
+++ b/bin/WGET.182/retr.c
@@ -141,7 +141,7 @@ get_contents (int fd, FILE *fp, long *len, long restval, long expected,
#ifdef __LARGE__
static char c[2048];
#else
- static char c[8192];
+ static char c[131072];
#endif
void *progress = NULL;
struct wget_timer *timer = wtimer_allocate ();
We can get a better estimate of maximum possible throughput. The difference then becomes:
Before: ~5.3 MB/s (current master
branch)
After: ~8.8 MB/s (this PR)
Which is pretty close to maximum link speed!
Which is pretty close to maximum link speed!
Pretty impressive work there! 🥇
But all this is with a djgpp version of Wget on your LAN? Have you tested with a Large-model or Windows (with Watcom etc. if you have that) on a remote server. Like between Japan <--> Belgium.
Yes, this needs to be tested in different scenarios. I feel like I might be optimizing for one specific use case. On high-latency connections I expect it will fall apart, but haven't really tried yet.
I did try building wget with watcom but it complains about the makefile syntax. I'm really not familiar with the watcom toolset at all, would appreciate some help here (if anyone else happens to listen in on this).
Found a server on the other side of the planet (ping 290ms).
$ wget http://mirror.linux.org.au/debian/dists/unstable/main/Contents-source.gz -T 0 -O /dev/null
Windows (MSYS2): Constant 850 KB/s
Linux: Starts slow, but increases gradually to 10 MB/s (wow!)
Watt32 on djgpp (current master
branch): Constant 65 KB/s
Watt32 on djgpp (this PR): Constant 40 KB/s
Oops :) But not entirely unexpected. Let's think about this some more...
Windows (MSYS2): Constant 850 KB/s
What's your max WAN link-speed? I'm getting 5.09MB/s with wget/Windows (MSVC) on my 100 MBit/s WAN.
Linux: Starts slow, but increases gradually to 10 MB/s (wow!)
Nice. Some tricks inside Linux that does this?
On 2nd attempt, Windows is similar to Linux:
I'm on a gigabit link, so no bottleneck there. Also I tested on Windows 7, maybe 10/11 has better TCP implementation (I should upgrade at some point).
For Linux I guess they have accurate round-trip timing to know when to send earlier window updates. But that's just a guess.
Looking at Wireshark on the Linux side:
It uses the TCP timestamp option and window size of 3MB. Window size does always stay constant. ACKs are sent fairly late.
Also note the effect of TCP offloading, the network card bundles up multiple packets and presents them to the OS as one large block. We don't have this luxury here :)
Also note the effect of TCP offloading,
If that include checksum offloading, I've disabled that to see the calculated checksums in Wireshark etc. So the Windows speed is comparable to Linux AFAICS; starts slow but increases gradually to 8 MB/s here.
I'm really not familiar with the watcom toolset at all,
Wmake is really weird and ugly compared to GNU-make. I'll make an USE_WATCOM=1
target for Makefile.Windows.
Thinking out loud:
Problem is the delay between receiving the last packet and sending a window update. We can receive packets pretty quick, but then it takes a while before we get to process it. In theory sending ACKs faster with reduced window size will allow the sender to adapt and send at a more constant rate. But that's not happening, so something on our end must be causing a fairly constant delay.
If I were to redesign TCP, I would add some sort of window pre-announce mechanism. So a sender could say with the first packet "I will be transmitting N full segments now". And the receiver immediately replies "after those N segments, my window size will be at least X". Then the sender can always keep going without ever having to wait for a window update.
But that's not happening, so something on our end must be causing a fairly constant delay.
Does this have something to do with Nagle's Algorithm?
Hm, not likely. I do think the Nagle implementation is slightly wrong, but that is another topic.
I've done some more experimenting. If I remove the window update treshold, the high-latency case improves to ~75 KB/s without hurting LAN too much (6.3 MB/s). Doing fast-ACK from FSM, I can get 100 KB/s (or even 200 with 64K window). But LAN really suffers.
What I think is happening then:
By the time we get to tcp_tick()
, there may already be a backlog of 20+ packets. If you do fast-ACK there, then by the time you're done with the first set of packets, 20 more may have arrived (since TCP_SEND()
is so slow, and round-trip is very short). So you get stuck in tcp_tick()
for a long time and are never able to clear the buffer.
To elaborate on the "slightly wrong" Nagle mode: While there is still unacked data, it's supposed to send only whole segments. But if we have a whole segment and a few bytes, Watt32 will also send a partial one. Small difference, but it is "slightly wrong".
There is also a modification to Nagle's algorithm described here. Could take a look at that sometime.
Did you use the RDTSC or the 8254 timer as you mentioned in https://github.com/gvanem/Watt-32/issues/99?
Oh now you're on to something. I had been using the default, which I thought should be rdtsc. I saw you can toggle it with environment var USE_RDTSC
.
So I set USE_RDTSC=0
. Doesn't seem to make much of a difference.
I enable it again with USE_RDTSC=1
. Now I see 11.3MB/s on LAN somehow.
I'm very confused now. Need to figure out how this works.
This would all make sense if the timer code was using gettimeofday()
since that is a very slow function. So that could be the "fairly constant delay" I was talking about earlier.
I did some benchmarking:
#include <stdio.h>
#include <tcp.h>
static uint64_t rdtsc()
{
uint64_t count;
asm volatile ("rdtsc" : "=A" (count));
return count;
}
int main()
{
uint64_t begin, end;
init_misc();
//hires_timer(0);
begin = rdtsc();
for (int i = 0; i < 100000; ++i)
{
set_timeout(0);
}
end = rdtsc();
printf("%llu\n", end - begin);
return 0;
}
hires_timer(0): 3387089559 cycles (avg 45161ns per call)
hires_timer(1): 3387605858 cycles (avg 45168ns per call)
USE_RDTSC=1 : 27900545 cycles (avg 372ns per call)
This shows that the timer code is a serious bottleneck - set_timeout()
etc is used everywhere.
The default option, at least for djgpp, is equivalent to hires_timer(1)
, so 8254 enabled. It doesn't appear to have much effect here, because the 8254 code still calls libc time()
, which is equally slow as gettimeofday()
.
RDTSC is decently fast on average, but it still calls gettimeofday()
every now and then, so will cause random delays.
Question is, do we need exact UNIX/UTC time anywhere? For basic timeouts you can just start counting from 0 at program startup, and avoid all the slow time.h stuff from libc.
Question is, do we need exact UNIX/UTC time anywhere? For basic timeouts you can just start counting from 0 at program startup, and avoid all the slow time.h stuff from libc.
Agreed. I'll try to make a modified version of src/tests/ttime.c
that compares the CPU-clocks between various timer-functions.
I'm experimenting with the delayed-ACK / window update code. Some radical changes here, so this needs proper testing to ensure that it helps in all scenarios. And there may be bugs still.
RFC 1122 says:
But in our case, the packet driver's "send" function is a major bottleneck. We should avoid sending small packets at all costs. The solution: remove fast-ACK altogether, delay as long as possible.
In addition, we should never advertise a receive window smaller than MSS. Ideally, keep it wide open. So, the window should be updated only from
tcp_read()
. It makes no sense to do so from FSM or retransmitter - they can't solve the problem.In this patch I also allocate a receive buffer that is 2x the window size. Not sure yet if it's necessary.
Things to check:
tcp_Socket.recv_win
is only updated for in-order data. That is likely not correct.Test procedure with the bundled wget:
And in wattcp.cfg:
tcp.recv_win = 65536
Results so far look promising:
Before: ~4.0MB/s After: ~6.4MB/s