[ktcp] TCP performance - Githubissues

ghaerr / elks

Embeddable Linux Kernel Subset - Linux for 8086

Other

1.01k stars 108 forks source link

[ktcp] TCP performance #745

Closed Mellvik closed 4 years ago

Mellvik commented 4 years ago

First, networking status summary per september 8th, 2020: [@ghaerr - please correct if my details are off]

Serial IP (SLIP/CSLIP): Working @ moderate baud rates, needs serial HW flow control to get to the next level.
Ethernet Link level NE2k: Stable
ICMP, ARP, TCP: Stable
Telnet/PTYs: Not stable
The ELKS httpserver can deliver any file and file size reliably.
urlget is reliable for incoming data.
urlget is also ftpget, which currently does not work but 'in the works'.

Stable TCP-level communication enables file transfers (#697) to/from ELKS, and it's time to get transfer speeds from 'working' to 'useable'. Currently, transfer speed from ELKS (IDE harddisk) via HTTP is 25.5KB/s, packet size 140 bytes (Compaq Portable 386/20, ISA16 NE2k card). For incoming file transfer (to /dev/null) the advertised window size is 255, effective packet size is 295, speed 36KB/s. We may be able to improve both by an order of magnitude, but half of that is enough to make for a very useable system.

@ghaerr - I'm assuming there are some low threshold improvements that can be made immediately, any hints as to where to start?

--Mellvik

ghaerr commented 4 years ago

Here is my take on the status of TCP/IP. I'm not sure I'd use "Stable", but instead "Buggy":

Serial IP (SLIP/CSLIP): Working @ moderate baud rates, needs serial HW flow control to get to the next level. SLIP - Not buggy; CSLIP - Not working at all. Both dependent on serial driver, not TCP/IP for higher baud rates. I can run at 38400 on all systems using fast serial driver, and 115200 in my faster box using regular driver. Agreed serial HW flow control likely to help, but that's really quite separate from TCPIP and unclear as to how much throughput will actually increase, given max serial speed on old boxes even with flow control.
Ethernet Link level NE2k: Stable - AGREE, needs more testing since you're the only one running it.
ICMP, ARP, TCP: Stable - ICMP, ARP: Not buggy. TCP - Stable, but extremely buggy on lost packets and closing sockets. No sliding window implemented, and unable to accept large window (or packets) since currently the only real flow control TCP has is to drop a packet when the upstream isn't ready, or doesn't have buffer space to accept data.
Telnet/PTYs: Not stable - No, both stable and working well. ELKS PTYs operate differently than Linux PTYs: they need a process on either side of them to pump data in and out. On Linux, only one process is needed and the kernel pumps the data across, so that slows things down on ELKS.
The ELKS httpserver can deliver any file and file size reliably. - Have only tested receiving data from ELKS.
- urlget is reliable for incoming data. - Haven't tested it.
urlget is also ftpget, which currently does not work but 'in the works'. - Unknown, can you give a bug report?

Stable TCP-level communication enables file transfers (#697) to/from ELKS, and it's time to get transfer speeds from 'working' to 'useable'. Currently, transfer speed from ELKS (IDE harddisk) via HTTP is 25.5KB/s, packet size 140 bytes (Compaq Portable 386/20, ISA16 NE2k card). For incoming file transfer (to /dev/null) the advertised window size is 255, effective packet size is 295, speed 36KB/s. We may be able to improve both by an order of magnitude, but half of that is enough to make for a very useable system.

Sadly, I see no easy way to increase incoming packet size without a lot of work. The TCPDEV (not TCP/IP, but the /dev/tcpdev that transfers data between kernel and ktcp) only allows 100 bytes transferred into the kernel, currently. We have very limited buffer space per connection (current 4096 bytes each direction in ktcp) without running out of space in ktcp when having lots of connections open. Adding more buffer space just prolongs the crash/resync issue when ktcp finally has to drop a packet. The problem is that the data has to go through the network, through ktcp, into buffers, into the kernel, back out of the kernel from the first half of a PTY to a daemon process, then if interactive back to the kernel from there to another PTY, then to a receiving application, and if that chain gets backed up, ktcp has to drop a packet. The whole chain has to keep up with the incoming flow or degradation occurs. Sliding window just adds more packets received into more buffers, but still the chain has to process received data. Without fixing TCP/IP packet loss issues, and then writing lots of code to implement sliding window, then looking into TCPDEV, I can't see how to either get packet size up nor kernel throughput increased. Currently, there's another system design problem with PTYs in that a PTY buffer needs to be bigger than any received data or the system will hang waiting for I/O in the opposite direction (this could be fixed, its on my long list). I don't even know whether smaller packetize is the bottleneck: it could easily be the numerous memcpy's all over the place and or just slower system speed is a basic problem.

@ghaerr - I'm assuming there are some low threshold improvements that can be made immediately, any hints as to where to start?

You probably won't like this answer, but I can't think of any immediate improvements, or believe me, I would have done it already!!!

Perhaps now would be a time to port other file transfer programs and/or start using the (slower) networking we have working in more ways. We also need to fix the close issues, as they crash ktcp on occasion.

I was going to port rlogin, rlogind and rcp to ELKS but realized they're not present on any hosts so not sure that's worth the trouble. Then thought about porting tftp but ELKS TCP/IP doesn't support UDP.

We might want to describe some file transfer scenarios (both automated and manual) that would be big improvements to what we now have, and write or port that code.

Mellvik commented 4 years ago

OK @ghaerr, It looks like what I call stable, you call 'not buggy' - that's fine with me. And thanks for the details about the 'inner challenges'. A few more questions.

Here is my take on the status of TCP/IP. I'm not sure I'd use "Stable", but instead "Buggy":

Serial IP (SLIP/CSLIP): Working @ moderate baud rates, needs serial HW flow control to get to the next level. SLIP - Not buggy; CSLIP - Not working at all. Both dependent on serial driver, not TCP/IP for higher baud rates. I can run at 38400 on all systems using fast serial driver, and 115200 in my faster box using regular driver. Agreed serial HW flow control likely to help, but that's really quite separate from TCPIP and unclear as to how much throughput will actually increase, given max serial speed on old boxes even with flow control.

OK; I'm unsure how important this is (i.e. priority): Implemernting HW flow control in ELKS is not a big deal but would it help - since most of the stuff we interface with these days has no such thing, just RX, TX & GND. HW flow control will help between old systems, how likely is that? I guess the one who has the itch, drives the work... Ethernet Link level NE2k: Stable - AGREE, needs more testing since you're the only one running it. Indeed, more testers, more testing. ICMP, ARP, TCP: Stable - ICMP, ARP: Not buggy. TCP - Stable, but extremely buggy on lost packets and closing sockets. No sliding window implemented, and unable to accept large window (or packets) since currently the only real flow control TCP has is to drop a packet when the upstream isn't ready, or doesn't have buffer space to accept data. So it's the same as ETHER: Needs more testing. That's what I'm doing - and it's going surprisingly well. More below. Telnet/PTYs: Not stable - No, both stable and working well.

OK; I misunderstood one of your comments in the discussion with MFLD (#744). ELKS PTYs operate differently than Linux PTYs: they need a process on either side of them to pump data in and out. On Linux, only one process is needed and the kernel pumps the data across, so that slows things down on ELKS.

Is this something that should be fixed (the single vs dual process architecture)? The ELKS httpserver can deliver any file and file size reliably. - Have only tested receiving data from ELKS. That's the only functionality it has, so that's what I'm saying - 'can deliver any file any size reliably'. Which means we have a reliable (and scriptable) outgoing transport. Just last weekend transferred an entire HD image out of the ELKS box (512MB) in order to repartition (partition magick won't work w/o VGA graphics). Will move it back using urlget. urlget is reliable for incoming data. - Haven't tested it.

As pointed out before, I have tested it extensively, it's fine. I'm adding a progress indictor and some other minor necessities. urlget is also ftpget, which currently does not work but 'in the works'. - Unknown, can you give a bug report? I'll get back on that. Right now it just hangs, but ktcp is fine when the process is killed, so the first hunch is that it is not network related. Having FTP available in a command line tool would be great. Stable TCP-level communication enables file transfers (#697 https://github.com/jbruchon/elks/issues/697) to/from ELKS, and it's time to get transfer speeds from 'working' to 'useable'. Currently, transfer speed from ELKS (IDE harddisk) via HTTP is 25.5KB/s, packet size 140 bytes (Compaq Portable 386/20, ISA16 NE2k card). For incoming file transfer (to /dev/null) the advertised window size is 255, effective packet size is 295, speed 36KB/s. We may be able to improve both by an order of magnitude, but half of that is enough to make for a very useable system.

Sadly, I see no easy way to increase incoming packet size without a lot of work. The TCPDEV (not TCP/IP, but the /dev/tcpdev that transfers data between kernel and ktcp) only allows 100 bytes transferred into the kernel, currently.

Is this something that can be improved using - on demand - your heap_alloc() routine? We have very limited buffer space per connection (current 4096 bytes each direction in ktcp) without running out of space in ktcp when having lots of connections open.

How many is 'lots of connections'? Do we need that now? Adding more buffer space just prolongs the crash/resync issue when ktcp finally has to drop a packet. The problem is that the data has to go through the network, through ktcp, into buffers, into the kernel, back out of the kernel from the first half of a PTY to a daemon process, then if interactive back to the kernel from there to another PTY, then to a receiving application, and if that chain gets backed up, ktcp has to drop a packet. The whole chain has to keep up with the incoming flow or degradation occurs.

I guess the complexity of this brings up the user process vs kernel stack again. Is the balance between the two changing with the ongoing work on kernel far text segments? Sliding window just adds more packets received into more buffers, but still the chain has to process received data. Without fixing TCP/IP packet loss issues, and then writing lots of code to implement sliding window, then looking into TCPDEV, I can't see how to either get packet size up nor kernel throughput increased. Currently, there's another system design problem with PTYs in that a PTY buffer needs to be bigger than any received data or the system will hang waiting for I/O in the opposite direction (this could be fixed, its on my long list). I don't even know whether smaller packetize is the bottleneck: it could easily be the numerous memcpy's all over the place and or just slower system speed is a basic problem.

@ghaerr https://github.com/ghaerr - I'm assuming there are some low threshold improvements that can be made immediately, any hints as to where to start?

You probably won't like this answer, but I can't think of any immediate improvements, or believe me, I would have done it already!!!

My thinking is that now that we have some stability (ahem, bugfreeness) and physical ethernet support, the debugging and tuning should be much easier than before since we can see the traffic down to the final bit if we like. It's my take that what we have is an impressive feat given the environment in which it was developed. Perhaps now would be a time to port other file transfer programs and/or start using the (slower) networking we have working in more ways. We also need to fix the close issues, as they crash ktcp on occasion.

My suggestion is to start with what we have - which is working (http), add ftp (incoming) soon and take it from there. I was going to port rlogin, rlogind and rcp to ELKS but realized they're not present on any hosts so not sure that's worth the trouble. Then thought about porting tftp but ELKS TCP/IP doesn't support UDP.

If we're sure this it the IP stack we're going for, I'd say adding UDP is a priority. We might want to describe some file transfer scenarios (both automated and manual) that would be big improvements to what we now have, and write or port that code.

That was the purpose of my note in #697.

—Mellvik

ghaerr commented 4 years ago

Is this something that should be fixed (the single vs dual process architecture)?

No, the PTY code is fine, just has a few issues we need to think about. There are many other priorities in TCP that should come first.

Just last weekend transferred an entire HD image out of the ELKS box (512MB)

Wow, how long did that take?!?!

Is this something that can be improved using - on demand - your heap_alloc() routine?

No, too complicated. But I've added increasing that size to my todo list as something to consider. I'm not sure yet its the bottleneck.

I guess the complexity of this brings up the user process vs kernel stack again.

You mean user vs kernel process? I don't think there's much chance of rewriting all the networking stuff and put it in the kernel, way too much work at this point anyways.

Is the balance between the two changing with the ongoing work on kernel far text segments?

I am hoping so, not for networking, but for making space for all our other improvements. Last night I worked on that. We are currently at ONLY 6k bytes free kernel code space. And moving all the kernel init routines into a far text segment only gave us 3.4k bytes. And that kernel won't boot, likely due to compiler and other issues. So its going to be a long road... and every procedure has to be individually labeled at this point. No easy answers yet, so tighten our kernel code belts!

Mellvik commented 4 years ago

Just last weekend transferred an entire HD image out of the ELKS box (512MB)

Wow, how long did that take?!?!

I did report the speed - 27,5KB/s. Somewhat better for incoming, but then, that was w/o any local i/o (/dev/null). Anyway, multiplied by 512m gives a BIG number - and a BIG incentive to look for speedups. Is this something that can be improved using - on demand - your heap_alloc() routine?

No, too complicated. But I've added increasing that size to my todo list as something to consider. I'm not sure yet its the bottleneck.

I guess the complexity of this brings up the user process vs kernel stack again.

You mean user vs kernel process? I don't think there's much chance of rewriting all the networking stuff and put it in the kernel, way too much work at this point anyways.

Is the balance between the two changing with the ongoing work on kernel far text segments?

I am hoping so, not for networking, but for making space for all our other improvements. Last night I worked on that. We are currently at ONLY 6k bytes free kernel code space. And moving all the kernel init routines into a far text segment only gave us 3.4k bytes. And that kernel won't boot, likely due to compiler and other issues. So its going to be a long road... and every procedure has to be individually labeled at this point. No easy answers yet, so tighten our kernel code belts!

OK, I have a real incentive now to put some effort into speedups, so if you can think of something to test let me know.

IN the meanwhile I'll fix ftpget, improve urlget and take a look at implementing PUT in the httpd.

—Mellvik

ghaerr commented 4 years ago

We have very limited buffer space per connection (current 4096 bytes each direction in ktcp) without running out of space in ktcp when having lots of connections open.

How many is 'lots of connections'? Do we need that now?

Well, actually only two TCP connections... but when running just two simultaneous ELKS telnet connections to localhost, we pretty much max out ELKS: 15 processes (1 left to run ps), 4 PTYs, 2 shells, 2 telnets (shared code), 2 telnetd's (forked), maxed out heap (prior to lowering the kernel stack size to 512), plus ktcp and the regular processes. This is the test setup I want to use in testing the improved heap allocator in #744, by ending and restarting the 2nd telnet localhost connection, which will stress the near heap with the large and small PTY allocations when nearly full.

Mellvik commented 4 years ago

How many is 'lots of connections'? Do we need that now?

Well, actually only two TCP connections... but when running just two simultaneous ELKS telnet connections to localhost, we pretty much max out ELKS: 15 processes (1 left to run ps), 4 PTYs, 2 shells, 2 telnets (shared code), 2 telnetd's (forked), maxed out heap (prior to lowering the kernel stack size to 512), plus ktcp and the regular processes. This is the test setup I want to use in testing the improved heap allocator in #744, by ending and restarting the 2nd telnet localhost connection, which will stress the near heap with the large and small PTY allocations when nearly full.

— Point taken.

I'd love to replace the localhost part with physical - when you're ready.

-- M

Mellvik commented 4 years ago

Since the previous post on this thread, there have been too many PRs, improvements, test reports and discussions to count. The 'slow but working' status has changed to 'stable and quite fast' - in particular on the TCP level. Here are some test results using the latest build (4512f614) with minor changes to the ne2k driver:

[386/20]

Outgoing speed has improved significantly again. What started @ 25kB/s, moved to above 50, is now about 80kBps using comparable tests (curl from disk image).
Serving curl from /dev/zero, which removes ELKS local disk speed form the equation, tops out at 95kBps, which is impressive given where we started. The same transfer using the unmodified ne2k driver (wait loops in pack_put) top @ 86,6k, about 10% penalty as suggested by @ghaerr.
Other network activity, such as an active telnet connection, reduces transfer speed drastically, such as from 80 to 3 kbps. Not KTCP-issue, but when there is floppy activity, the network stops entirely - som times for up to 10 seconds.
Outgoing speed, as measured using the time command on ELKS, was about 35kBps before the latest bach of optimizations and tuning from @ghaerr, now about 38kBps, measured w/o disk (to /dev/null).
Standard ping now has a RTT slightly below 1.9ms with the modified driver, slightly above 2ms without.

In order to get a metric on system capability vs. network speed, the expansion chassis containing the network card was moved to a 286/12,5MHz machine and tests repeated (modified/fast driver). Both the 286 and the 386 have a cycle to instruction ratio of approx. 4.5, so the performance difference in real mode should be about 1.6. Adding the effects of other HW improvements the expected difference should be about 1:2.

Ping increased to 3.1ms
curl from disk (raw image) is 26 kBps
curl from /dev/zero is 48kBps ... indicating that the disk is significantly slower on the older system.

The 286 test is interesting beyond the numbers because the slow speed pushes robustness real hard. Overruns and retransmits all the time, and here's another cork-popper: It is now really hard to crash ktcp. Concurrent file transfers, several telnets both ways active, retransmits & overruns en massé, and we're still running. Flood-pinging the 286 system with large packets eventually sent ktcp into a loop that hung the system. I'm working on creating a repeatable scenario for this, it may be hard on a faster system.

There is one other situation that emitted errors: This is on the 386 and the faster version of the drivers: Having a curl file transfer going, telnetting into elks and then out (so we have a double telnet in effect), cause occasional character loss. Again I'm woking on creating a repeatable scenario. Given the stability of the other testing, it seems unlikely that this is tcp-level... ?

All in all, fantastic improvement - and IMHO time to close this issue.

--mellvik

ghaerr commented 4 years ago

Hi @Mellvik,

Thanks for your report. Agreed, all-in-all, great improvements seen. I wasn't able to work on most of these issues until I got my real hardware NE2K card working, thanks to you.

I have seen very few system crashes, none repeatable. I did see one outbound ELKS-to-Linux curl transfer end up creating a 23Mb file whose size was proper but contents not correct. Unfortunately I deleted that and don't have it for analysis. So there seems to be a possibility where TCP fails without error, which is concerning. I would suggest further high-error transfers and diff or cmp comparison to see if we can get that to repeat.

There are a couple other high-priority problems with ktcp which still occur regularly:

Connecting telnet quickly again after disconnect gives "SYN sent, wrong ACK" error. This is because the one half of the TCP connection is still open (timing out after 4 seconds), as shown by netstat. The disconnecting telnet needs to shut down the whole connection properly to fix this. I will work on this next.
Rarely, I am seeing outbound ELKS to Linux telnet connect, then hang, waiting for "Connect". I think there is still a race condition in ktcp for connect/accept somewhere.

Lower priority:

We also need to re-test SLIP, as the packets will now be larger, as the result of the PTY/telnetd/kernel queue/buffer sizes increasing from 100 to 512. This could generate RS232 receive errors. CSLIP is still completely broken.
I plan on making the "ktcp -m MTU" option actually affect packet sizes, rather than just broadcasting the MTU, as it does now.

The ELKS issues with disk I/O and subsequent loss of TCP/IP speed are not fixed easily, and are the result of using the synchronous BIOS calls rather than interrupt-driven I/O, as well as just slow floppies. This problem is already discussed in #521.

All in all, fantastic improvement - and IMHO time to close this issue.

Agreed, lets close this and report new issues when they arise.

Thank you!

Mellvik commented 4 years ago

I have seen very few system crashes, none repeatable. I did see one outbound ELKS-to-Linux curl transfer end up creating a 23Mb file whose size was proper but contents not correct. Unfortunately I deleted that and don't have it for analysis. So there seems to be a possibility where TCP fails without error, which is concerning. I would suggest further high-error transfers and diff or cmp comparison to see if we can get that to repeat.

Good point - I'll put that on the top of my list. Along with the telnet losing data. There are a couple other high-priority problems with ktcp which still occur regularly:

Connecting telnet quickly again after disconnect gives "SYN sent, wrong ACK" error. This is because the one half of the TCP connection is still open (timing out after 4 seconds), as shown by netstat. The disconnecting telnet needs to shut down the whole connection properly to fix this. I will work on this next. Yes, I noticed with one while testing ^p with DEBUG_TCP this morning. Seems like there is a 10 second (or maybe 10 rounds of the .5 sec loop), maybe this is useful:

tcpdev: got close from ELKS process TCP: send src:23 dst:44230 flags:11 seq:f856bb44 ack:a4047abd win:255 urg:0 chk:0 len:20 rtt 0 RTT 5 RTO 10 ktcp: update 0,1 expire state 5 expire state 1 expire state 1 TCP: recv src:44230 dst:23 flags:11 seq:a4047abd ack:f856bb45 win:29200 urg:0 chk:0 len:20 TS_FIN_WAIT_1 TCP: send src:23 dst:44230 flags:10 seq:f856bb45 ack:a4047abe win:255 urg:0 chk:0 len:20 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 tcp: REMOVING control block expire state 1 expire state 1

The ELKS issues with disk I/O and subsequent loss of TCP/IP speed are not fixed easily, and are the result of using the synchronous BIOS calls rather than interrupt-driven I/O, as well as just slow floppies. This problem is already discussed in #521 https://github.com/jbruchon/elks/issues/521.

Yes, I do realize that - and while on the subject, I propose we keep the hd major devices for a while - I hope to do some testing with the direct drivers to compare, not so much the speed but the interrupt blocking. Eventually.

BTW - I did run across a new problem this morning: I opened a tcp connection from ELKS (ftpget) with the wrong IP address, using 20.0.2.2 instead of 10.0.2.2. This hangs the process and blocks ktcp until reboot. ICMP works, incoming telnet works partly but does not complete a connection. Trying another ftpget using the correct address hangs that process too. Not interruptible, not SIGINTR, not kill -9 externally.

—Mellvik

Mellvik commented 4 years ago

Final numbers as we close this issue: Still on 386/20MHz compaq, and now with DEBUG_TCP turned off as suggested by @ghaerr:

regular ping 1.8ms
curl test from /dev/zero: 97.3kBps
ftpget to /dev/null: 43.7k

A new set of records to beat.

-M

ghaerr commented 4 years ago

I propose we keep the hd major devices for a while - I hope to do some testing with the direct drivers to compare, not so much the speed but the interrupt blocking. Eventually.

The devices will all be there, with the same numbers. Instead of using /dev/hda, you'll use another name for testing. I think it important that ELKS be easy to understand, which is the reason for using /dev/hd* rather than /dev/bd* (shortly). The unused /dev/hd* devices were commented out this last week. Should you start development on the direct driver, you can uncomment one of the /dev/hd* names, which will likely be renamed /dev/dhd* (for direct hd). I think getting the system running under the direct driver will be a major undertaking - none of the interrupt-driven buffer management code, in addition to the block driver itself, has ever been tested, and it is very complicated. In addition, the floppy code will have to use the old driver. All-in-all, a big project for v0.5 or later.

(ftpget) with the wrong IP address, using 20.0.2.2 instead of 10.0.2.2. This hangs the process and blocks ktcp until reboot.

I'll add connection timeout to the list.

Mellvik commented 4 years ago

Some research on two of these issues, first the good news:

Bogus content in curl transfer: I have verified that this was indeed the case by going through and comparing older transfers (one week ago and back). Blocks of 120bytes each were corrupted, starting after about 70 blocks (512bytes ea). The problem has already been fixed by the recent updates from @ghaerr . There is no way to reproduce it now, not even with lots of retransmits and NIC overflows.
Telnet losing content: Telnetting into elks, then out to some other host and producing significant amounts of output (ls -l bigdir or cat file where file is larger than a couple of kilobytes) shows data lost. The pattern is
- 512b OK
- 504b lost
- 512b OK
- 504b lost etc.

The telnet problem is at least partly reproducible in qemu, but the # of bytes between and the size of the loss are less predictable.

➜  ~ ls -l
total 41184
drwx------@   4 helge  staff       128 Jan 19  2018 Applications
drwx------@ 295 helge  staff      9440 Oct  2 15:40 Desktop
drwx------@  82 helge  staff      2624 Sep 15 19:40 Documents
drwx------@ 607 helge  staff     19424 Oct 12 14:09 Downloads
drwx------@  61 helge  staff      1952 Feb 24  2020 Dropbox
drwx------@  20 helge  staff       640 Oct 13 09:17 Google Drive
drwx------@  89 helge  staff      2848 Nov 25  2019 Library
-rwxr-xr-x    1 helge  staff     32204 Mar 10  2020 Menuconfig
drwxrwxr-x  161 root   wheel      5152 Sep 28  2017 Microsoft
drwx------+  14 helge  staff       448 Jan 11  2020 Movies
drwx------+   8 helge  staff       256 Jan 14  2020 Music
drwx------+   6 helge  staff       192 Feb 12  2020 Pictures
drwxr-xr-x+   4 helgestaff       128 Jan 18  2018 Public
drwxr-xr-x   33 helge  staff      1056 May  4 10:42 VirtualBox VMs
drwxr-xr-x  129 helge  staff      4128 Jul 12  2018 adresselister-madmimi
-rwxr-xr-x    1 helge  staff   1096147 Oct  7 00:16 configure
-rw-r--r--    1 helge  staff    742364 Sep 18 08:57 elks-transfer.log
drwxr-xr-x   23 helge  staff       736 Apr 26  2019 monodevelop
-rw-r--r--    1 helge  sta   24967 Aug 14 11:52 ne2k-mac.S-pre-karma
-rw-r--r--    1 helge  staff         0 Apr 23 15:34 new.diff
drwxr-xr-x   36 helge  staff      1152 Dec 10  2019 pcjs
-rw-r--r--    1 helge  staff  18333696 Mar 21  2020 rq0-ra81.dsk
drwxr-xr-x   36 helge  staff      1152 Sep 14 12:15 src
drwxr-xr-x   50 helge  staff      1600 Oct  6 22:23 tmp
➜  ~ rm configure

It also seems - although harder to verify - that a possibly related problem applies to telnet input: If exposed to a paste operation, most of the input gets lost. Not 100% consistent, but the first ~400 (readings: 404-420) chars are accepted, the rest gets dropped.

--Mellvik

Mellvik commented 4 years ago

Entirely different question - same topic: I've been wondering and forgotten to ask - does the difference between sent packets and received packets (1:2) in netstat have any specific significance or is it a bug?

# netstat
----- Received ---------  ----- Sent -------------
TCP Packets       735116  TCP Packets      1465367
TCP Dropped            0  TCP Retransmits       31
TCP Bad Checksum       0  TCP Retrans Memory     0
IP Packets        742578  IP Packets       1472860
IP Bad Checksum        0  IP Bad Headers         0
ICMP Packets        7462  ICMP Packets        7462
SLIP Packets           0  SLIP Packets           0
ETH Packets       742860  ETH Packets      1473142
ARP Reqs Sent          0  ARP Replies Rcvd       0
ARP Reqs Rcvd        363  ARP Replies Sent     363
ARP Cache Adds         2

 No        State    RTT lport        raddress  rport
-----------------------------------------------------
  1  ESTABLISHED 1000ms  1024         0.0.0.0      2
  2  ESTABLISHED   62ms    23       10.0.2.59  48742
  3       LISTEN 1000ms    80         0.0.0.0      0
  4       LISTEN 1000ms    23         0.0.0.0      0

Mellvik commented 4 years ago

TCP transfer speed: For the record - and for reference

The 386/20 + Eagle ne2k pnp card delivers these numbers on MTCP (packet driver)/DOS 6.22:

Receive (FTP) to disk 240 kbps
Send (FTP) to disk 294 kbps

That's probably as fast as this HW can be pushed. If we eventually get half of that, we're doing really well.

ghaerr commented 4 years ago

Hello @Mellvik,

Thanks for your reports! And do the bugs never end? Ugh. I will look into the telnet losing data problem. I hadn't seen it before, but will try duplication on QEMU. I am glad to hear we can't duplicate the curl issue, but I believe the possibility is still there, given enough packet losses or retransmits, since the core of the ktcp code wasn't changed. We only eliminated a driver packet overrun problem and tuned the system to slow down on large send windows.

does the difference between sent packets and received packets (1:2) in netstat have any specific significance ?

If you try ^P just before connecting using telnet (from ELKS to outside), it shows that ktcp always ACKs a received packet separately before sending yet another packet with data. Thus there seems to be twice as many sent packets as received. I originally noticed this inefficient behavior when debugging the NE2K ne2k_pack_put routine, since this is the "back-to-back" sending that caused that routine to not write the second packet on the wire.

I had forgotten about this inefficiency. I'll have to look more deeply to see why it occurs, and what could be done to combine the first ACK into the subsequent TCP data packet. That will probably be a project for v0.5!

Receive (FTP) to disk 240 kbps Send (FTP) to disk 294 kbps

Wow. And here I thought my 386 was fast with ktcp. Seeing that ktcp is sending two packets when it could be sending one, fixing that will surely provide a big speed increase. And of course, we're running a timesharing OS with TCP half in the kernel and half in user land, which definitely slows things down.

Thank you!

ghaerr commented 4 years ago

@Mellvik,

I can't get the telnet data loss to repeat on QEMU. I'm running an outside telnetd, then starting QEMU and telnetting from ELKS to macOS, all on serial console. Are you running multiple telnet sessions to get this to repeat? Can you send more specific info, thanks!

Mellvik commented 4 years ago

does the difference between sent packets and received packets (1:2) in netstat have any specific significance ?

If you try ^P just before connecting using telnet (from ELKS to outside), it shows that ktcp always ACKs a received packet separately before sending yet another packet with data. Thus there seems to be twice as many sent packets as received. I originally noticed this inefficient behavior when debugging the NE2K ne2k_pack_put routine, since this is the "back-to-back" sending that caused that routine to not write the second packet on the wire.

I had forgotten about this inefficiency. I'll have to look more deeply to see why it occurs, and what could be done to combine the first ACK into the subsequent TCP data packet. That will probably be a project for v0.5!

Interesting indeed @ghaerr. And yes, I agree, merged acks sound like something for 0.5. Still - this deficiency will, like you say, primarily affect telnet sessions since file transfers in general work this way. So your reply got me thinking: Most of the traffic in my sample netstat was curl - files coming out of ELKS, so maybe we're just seeing the recipient optimizing acks, acking only every other packet. I'll have to tcptrace that. We know from previous traces that ktcp is fine with that. Knowing that it works is good too. Receive (FTP) to disk 240 kbps Send (FTP) to disk 294 kbps

Wow. And here I thought my 386 was fast with ktcp. Seeing that ktcp is sending two packets when it could be sending one, fixing that will surely provide a big speed increase. And of course, we're running a timesharing OS with TCP half in the kernel and half in user land, which definitely slows things down.

Indeed we have some upward potential, in particular incoming traffic, which is now about 20% of this. Then again - if improvement continues at the speed it has recently, … :-)

—Mellvik

Mellvik commented 4 years ago

I can't get the telnet data loss to repeat on QEMU. I'm running an outside telnetd, then starting QEMU and telnetting from ELKS to macOS, all on serial console. Are you running multiple telnet sessions to get this to repeat? Can you send more specific info, thanks!

My exact setup is

qemu on Macos
telnet from the mac to elks
start a telnetd on the mac
telnetting back to macOS from ELKS
ls -l .. and the loss should be visible. If it isn't I'll have to look closer at the setup.

—Mellvik

Mellvik commented 4 years ago

So your reply got me thinking: Most of the traffic in my sample netstat was curl - files coming out of ELKS, so maybe we're just seeing the recipient optimizing acks, acking only every other packet. I'll have to tcptrace that. We know from previous traces that ktcp is fine with that. Knowing that it works is good too.

Turns out that is indeed the case. After receiving about 25k bytes, the recipient (linux/raspian) starts acking every other packet.

So it makes sense after all…

—Mellvik

ghaerr commented 4 years ago

qemu on Macos

telnet from the mac to elks

start a telnetd on the mac

telnetting back to macOS from ELKS

ls -l

Hmmm, works fine over here. I am running Homebrew telnet and telnetd, installed via "brew install telnet telnetd".

Telnet from macOS to ELKS via "telnet localhost 2323". Telnetd is started via "/usr/local/sbin/telnetd -debug 23 &". Telnet from ELKS to macOS via "telnet 192.168.0.10". Running "ls -l" on either side works without loss. Also tried "ls -lR /" on ELKS, and longer directories on macOS.

Using FWD="hostfwd=tcp:127.0.0.1:8080-10.0.2.15:80,hostfwd=tcp:127.0.0.1:2323-10.0.2.15:23" in qemu.sh.

Mellvik commented 4 years ago

Very strange indeed. It's exactly the same - except I'm telnetting to 10.0.2.2 from elks.

Will look closer in the morning.

-M

okt. 2020 kl. 19:16 skrev Gregory Haerr notifications@github.com:

qemu on Macos telnet from the mac to elks start a telnetd on the mac telnetting back to macOS from ELKS ls -l Hmmm, works fine over here. I am running Homebrew telnet and telnetd, installed via "brew install telnet telnetd".

Telnet from macOS to ELKS via "telnet localhost 2323". Telnetd is started via "/usr/local/sbin/telnetd -debug 23 &". Telnet from ELKS to macOS via "telnet 192.168.0.10". Running "ls -l" on either side works without loss. Also tried "ls -lR /" on ELKS, and longer directories on macOS.

Using FWD="hostfwd=tcp:127.0.0.1:8080-10.0.2.15:80,hostfwd=tcp:127.0.0.1:2323-10.0.2.15:23" in qemu.sh.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Mellvik commented 4 years ago

I am glad to hear we can't duplicate the curl issue, but I believe the possibility is still there, given enough packet losses or retransmits, since the core of the ktcp code wasn't changed. We only eliminated a driver packet overrun problem and tuned the system to slow down on large send windows.

OK, I was thinking some of the many adjustments you made, particularly at the buffer level, may have inadvertently (!) fixed the issue. I've been beating the system up pretty hard while testing in order to get the issue to repeat (lots of retransmits, overruns etc. so it's indeed surprising that the error didn't show.

So - I ran sum() on the fill disk (512MB) transferred from elks, and sum() on the raw device on elks where the data cam from (which took more than an hour) - indeed they are different. So the search continues.

IN the process I fixed a bug in sum(), which has been reporting wrong block count, and an inconvenience in dd() which did not accept standard input and output. PR coming for those two.

—Mellvik

ghaerr commented 4 years ago

Hi @Mellvik,

I haven't made any changes to the system at the buffer level. There could be a number of causes for this, given the number of issues we've seen in ELKS it could be unrelated to TCP and instead file I/O issues.

I ran sum() on the fill disk (512MB) transferred from elks, and sum() on the raw device on elks where the data cam from (which took more than an hour) - indeed they are different.

For debugging this issue, I would rather this be tested using known good tools, for instance transferring a file out of ELKS to Linux, and using the Linux cmp or diff command to compare binaries from different transfers. If it is a TCP issue, it will likely be far easier to debug the telnet issue first (which doesn't currently repeat on my system), which may be the root cause. It isn't clear running sum on a char device works properly on ELKS, for instance.

IN the process I fixed a bug in sum(), which has been reporting wrong block count, and an inconvenience in dd() which did not accept standard input and output. PR coming for those two.

LOL, not surprised - you're using a couple new ELKS programs!!

Mellvik commented 4 years ago

I ran sum() on the fill disk (512MB) transferred from elks, and sum() on the raw device on elks where the data cam from (which took more than an hour) - indeed they are different.

For debugging this issue, I would rather this be tested using known good tools, for instance transferring a file out of ELKS to Linux, and using the Linux cmp or diff command to compare binaries from different transfers.

Yes, that's essentially what I've been doing, using partial transfers from the raw drive and comparing them on Linux. The cmp on elks was a one-timer, and it's a good point that this may not be entirely predictable. (then there is another source of errors - If I happen to mount the drive in between I'm guaranteed to have differences). So I'll switch to a big static file instead. Thing is - When I found everything to be OK yesterday, I had transferred some 50MB 3-4 times via curl with no differences, no errors. If it is a TCP issue, it will likely be far easier to debug the telnet issue first (which doesn't currently repeat on my system), which may be the root cause. It isn't clear running sum on a char device works properly on ELKS, for instance.

That's a good point. I have more or less assumed it had to be a PTY issue, since data is actually lost, which in the file transfer case (curl), data is corrupted, not lost. IN the process I fixed a bug in sum(), which has been reporting wrong block count, and an inconvenience in dd() which did not accept standard input and output. PR coming for those two.

LOL, not surprised - you're using a couple new ELKS programs!!

Indeed - and I guess your list is longer than mine . all the small things and some bigger… BTW - while continuing to beat up ktcp, I'm getting closer to a predictable hang situation under heavy load. We're getting there.

—Mellvik

ghaerr commented 4 years ago

I have more or less assumed it had to be a PTY issue, since data is actually lost,

For telnet, that seems to be the case. I would like to know how you confirmed it is losing 504-512 bytes every other time.

which in the file transfer case (curl), data is corrupted, not lost.

Knowing what the corrupted data has actually been replaced with would be interesting. Perhaps sending a very large text file that can be visually looked at would help. Is the data replaced by another disk block/tcp packet, or garbage? Stuff like that.

I'm getting closer to a predictable hang situation under heavy load.

That's going to be another hard-to-debug issue, not looking forward to that one.

I am thinking it might be good to get a 0.4 version out before continuing to tackle the endless supply of bugs...

Mellvik commented 3 years ago

@ghaerr,

more testing has exonerated tcp from the list of suspects in this issue. Incoming (to ELKS) was never in question, outgoing looked suspicious for a while, but tests this weekend - transfers totalling more than 4 mill outgoing packets, partly in an idle system, partly shared with other traffic, retransmits etc., have been consistently 100% correct.

BTW - the outgoing speed is consistent - 75.6k bytes per sec.

We're back to investigating lost data in telnet and/or ptys.

—Mellvik

okt. 2020 kl. 18:19 skrev Gregory Haerr notifications@github.com:

I have more or less assumed it had to be a PTY issue, since data is actually lost,

For telnet, that seems to be the case. I would like to know how you confirmed it is losing 504-512 bytes every other time.

which in the file transfer case (curl), data is corrupted, not lost.

Knowing what the corrupted data has actually been replaced with would be interesting. Perhaps sending a very large text file that can be visually looked at would help. Is the data replaced by another disk block/tcp packet, or garbage? Stuff like that.

I'm getting closer to a predictable hang situation under heavy load.

That's going to be another hard-to-debug issue, not looking forward to that one.

I am thinking it might be good to get a 0.4 version out before continuing to tackle the endless supply of bugs...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/745#issuecomment-709436313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGODALMVY6CV3EWIH6CTSK4OHVANCNFSM4RADVHWA.

ghaerr commented 3 years ago

Hello @Mellvik,

Thanks for your continued testing on this bug.

We're back to investigating lost data in telnet and/or ptys.

I've been so deep in other matters, I'm finding it hard to remember the details of this: Is the error only occurring when telnetting in to ELKS and using our telnetd?

Something to try: since you're saying we're losing approx 504 bytes every other 512 bytes, there is the chance that the PTY character queue is overflowing, which would drop 512 characters. It is the only queue that is 512 bytes long, so a likely suspect. In order to test this theory, change the following line in elks/include/linuxmt/ntty.h:

#define PTYOUTQ_SIZE    512     /* pty output queue size (=TDB_WRITE_MAX and telnetd buffer)*/

to another number, like perhaps 800, or 400 and make clean. The size doesn't have to be a power of two anymore. If the new "dropped characters" amount is changed to near the new number, then we've confirmed that its the PTY driver dropping the characters. I don't yet know why that would be, lets try this first.

If this doesn't change anything, the other culprit will be telnetd itself, which could possibly be losing an entire block of incoming or outgoing data somehow.

Thank you!

Mellvik commented 3 years ago

Thank you @ghaerr, That's a head start as I dive back into this tomorrow.

BTW - the reliability of elks data transfers is really encouraging. We're moving gigabytes back and forth without promblems at reasonable speeds on old klunkers. Comparing to where we were just a few months back, well, I guess I have mentioned it before: worth a 'skaal'!

--Mellvik

nov. 2020 kl. 19:27 skrev Gregory Haerr notifications@github.com: Hello @Mellvik,

Thanks for your continued testing on this bug.

We're back to investigating lost data in telnet and/or ptys.

I've been so deep in other matters, I'm finding it hard to remember the details of this: Is the error only occurring when telnetting in to ELKS and using our telnetd?

Something to try: since you're saying we're losing approx 504 bytes every other 512 bytes, there is the chance that the PTY character queue is overflowing, which would drop 512 characters. It is the only queue that is 512 bytes long, so a likely suspect. In order to test this theory, change the following line in elks/include/linuxmt/ntty.h:

define PTYOUTQ_SIZE 512 / pty output queue size (=TDB_WRITE_MAX and telnetd buffer)/

to another number, like perhaps 800, or 400 and make clean. The size doesn't have to be a power of two anymore. If the new "dropped characters" amount is changed to near the new number, then we've confirmed that its the PTY driver dropping the characters. I don't yet know why that would be, lets try this first.

If this doesn't change anything, the other culprit will be telnetd itself, which could possibly be losing an entire block of incoming or outgoing data somehow.

Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ghaerr commented 3 years ago

I guess I have mentioned it before: worth a 'skaal'!

Is that a Norwegian beer... or a toast?

Mellvik commented 3 years ago

You guessed it! Cheers!

nov. 2020 kl. 20:47 skrev Gregory Haerr notifications@github.com:

I guess I have mentioned it before: worth a 'skaal'!

Is that a Norwegian beer... or a toast?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ghaerr / elks

[ktcp] TCP performance #745

define PTYOUTQ_SIZE 512 / pty output queue size (=TDB_WRITE_MAX and telnetd buffer)/