Closed Mellvik closed 4 years ago
Here is my take on the status of TCP/IP. I'm not sure I'd use "Stable", but instead "Buggy":
Serial IP (SLIP/CSLIP): Working @ moderate baud rates, needs serial HW flow control to get to the next level. SLIP - Not buggy; CSLIP - Not working at all. Both dependent on serial driver, not TCP/IP for higher baud rates. I can run at 38400 on all systems using fast serial driver, and 115200 in my faster box using regular driver. Agreed serial HW flow control likely to help, but that's really quite separate from TCPIP and unclear as to how much throughput will actually increase, given max serial speed on old boxes even with flow control.
Ethernet Link level NE2k: Stable - AGREE, needs more testing since you're the only one running it.
ICMP, ARP, TCP: Stable - ICMP, ARP: Not buggy. TCP - Stable, but extremely buggy on lost packets and closing sockets. No sliding window implemented, and unable to accept large window (or packets) since currently the only real flow control TCP has is to drop a packet when the upstream isn't ready, or doesn't have buffer space to accept data.
Telnet/PTYs: Not stable - No, both stable and working well. ELKS PTYs operate differently than Linux PTYs: they need a process on either side of them to pump data in and out. On Linux, only one process is needed and the kernel pumps the data across, so that slows things down on ELKS.
The ELKS httpserver can deliver any file and file size reliably. - Have only tested receiving data from ELKS.
urlget
is reliable for incoming data. - Haven't tested it.urlget
is also ftpget
, which currently does not work but 'in the works'. - Unknown, can you give a bug report?
Stable TCP-level communication enables file transfers (#697) to/from ELKS, and it's time to get transfer speeds from 'working' to 'useable'. Currently, transfer speed from ELKS (IDE harddisk) via HTTP is 25.5KB/s, packet size 140 bytes (Compaq Portable 386/20, ISA16 NE2k card). For incoming file transfer (to /dev/null) the advertised window size is 255, effective packet size is 295, speed 36KB/s. We may be able to improve both by an order of magnitude, but half of that is enough to make for a very useable system.
Sadly, I see no easy way to increase incoming packet size without a lot of work. The TCPDEV (not TCP/IP, but the /dev/tcpdev that transfers data between kernel and ktcp) only allows 100 bytes transferred into the kernel, currently. We have very limited buffer space per connection (current 4096 bytes each direction in ktcp) without running out of space in ktcp when having lots of connections open. Adding more buffer space just prolongs the crash/resync issue when ktcp finally has to drop a packet. The problem is that the data has to go through the network, through ktcp, into buffers, into the kernel, back out of the kernel from the first half of a PTY to a daemon process, then if interactive back to the kernel from there to another PTY, then to a receiving application, and if that chain gets backed up, ktcp has to drop a packet. The whole chain has to keep up with the incoming flow or degradation occurs. Sliding window just adds more packets received into more buffers, but still the chain has to process received data. Without fixing TCP/IP packet loss issues, and then writing lots of code to implement sliding window, then looking into TCPDEV, I can't see how to either get packet size up nor kernel throughput increased. Currently, there's another system design problem with PTYs in that a PTY buffer needs to be bigger than any received data or the system will hang waiting for I/O in the opposite direction (this could be fixed, its on my long list). I don't even know whether smaller packetize is the bottleneck: it could easily be the numerous memcpy's all over the place and or just slower system speed is a basic problem.
@ghaerr - I'm assuming there are some low threshold improvements that can be made immediately, any hints as to where to start?
You probably won't like this answer, but I can't think of any immediate improvements, or believe me, I would have done it already!!!
Perhaps now would be a time to port other file transfer programs and/or start using the (slower) networking we have working in more ways. We also need to fix the close issues, as they crash ktcp on occasion.
I was going to port rlogin, rlogind and rcp to ELKS but realized they're not present on any hosts so not sure that's worth the trouble. Then thought about porting tftp but ELKS TCP/IP doesn't support UDP.
We might want to describe some file transfer scenarios (both automated and manual) that would be big improvements to what we now have, and write or port that code.
OK @ghaerr, It looks like what I call stable, you call 'not buggy' - that's fine with me. And thanks for the details about the 'inner challenges'. A few more questions.
Here is my take on the status of TCP/IP. I'm not sure I'd use "Stable", but instead "Buggy":
Serial IP (SLIP/CSLIP): Working @ moderate baud rates, needs serial HW flow control to get to the next level. SLIP - Not buggy; CSLIP - Not working at all. Both dependent on serial driver, not TCP/IP for higher baud rates. I can run at 38400 on all systems using fast serial driver, and 115200 in my faster box using regular driver. Agreed serial HW flow control likely to help, but that's really quite separate from TCPIP and unclear as to how much throughput will actually increase, given max serial speed on old boxes even with flow control.
OK; I'm unsure how important this is (i.e. priority): Implemernting HW flow control in ELKS is not a big deal but would it help - since most of the stuff we interface with these days has no such thing, just RX, TX & GND. HW flow control will help between old systems, how likely is that? I guess the one who has the itch, drives the work... Ethernet Link level NE2k: Stable - AGREE, needs more testing since you're the only one running it. Indeed, more testers, more testing. ICMP, ARP, TCP: Stable - ICMP, ARP: Not buggy. TCP - Stable, but extremely buggy on lost packets and closing sockets. No sliding window implemented, and unable to accept large window (or packets) since currently the only real flow control TCP has is to drop a packet when the upstream isn't ready, or doesn't have buffer space to accept data. So it's the same as ETHER: Needs more testing. That's what I'm doing - and it's going surprisingly well. More below. Telnet/PTYs: Not stable - No, both stable and working well.
OK; I misunderstood one of your comments in the discussion with MFLD (#744). ELKS PTYs operate differently than Linux PTYs: they need a process on either side of them to pump data in and out. On Linux, only one process is needed and the kernel pumps the data across, so that slows things down on ELKS.
Is this something that should be fixed (the single vs dual process architecture)? The ELKS httpserver can deliver any file and file size reliably. - Have only tested receiving data from ELKS. That's the only functionality it has, so that's what I'm saying - 'can deliver any file any size reliably'. Which means we have a reliable (and scriptable) outgoing transport. Just last weekend transferred an entire HD image out of the ELKS box (512MB) in order to repartition (partition magick won't work w/o VGA graphics). Will move it back using urlget. urlget is reliable for incoming data. - Haven't tested it.
As pointed out before, I have tested it extensively, it's fine. I'm adding a progress indictor and some other minor necessities. urlget is also ftpget, which currently does not work but 'in the works'. - Unknown, can you give a bug report? I'll get back on that. Right now it just hangs, but ktcp is fine when the process is killed, so the first hunch is that it is not network related. Having FTP available in a command line tool would be great. Stable TCP-level communication enables file transfers (#697 https://github.com/jbruchon/elks/issues/697) to/from ELKS, and it's time to get transfer speeds from 'working' to 'useable'. Currently, transfer speed from ELKS (IDE harddisk) via HTTP is 25.5KB/s, packet size 140 bytes (Compaq Portable 386/20, ISA16 NE2k card). For incoming file transfer (to /dev/null) the advertised window size is 255, effective packet size is 295, speed 36KB/s. We may be able to improve both by an order of magnitude, but half of that is enough to make for a very useable system.
Sadly, I see no easy way to increase incoming packet size without a lot of work. The TCPDEV (not TCP/IP, but the /dev/tcpdev that transfers data between kernel and ktcp) only allows 100 bytes transferred into the kernel, currently.
Is this something that can be improved using - on demand - your heap_alloc() routine? We have very limited buffer space per connection (current 4096 bytes each direction in ktcp) without running out of space in ktcp when having lots of connections open.
How many is 'lots of connections'? Do we need that now? Adding more buffer space just prolongs the crash/resync issue when ktcp finally has to drop a packet. The problem is that the data has to go through the network, through ktcp, into buffers, into the kernel, back out of the kernel from the first half of a PTY to a daemon process, then if interactive back to the kernel from there to another PTY, then to a receiving application, and if that chain gets backed up, ktcp has to drop a packet. The whole chain has to keep up with the incoming flow or degradation occurs.
I guess the complexity of this brings up the user process vs kernel stack again. Is the balance between the two changing with the ongoing work on kernel far text segments? Sliding window just adds more packets received into more buffers, but still the chain has to process received data. Without fixing TCP/IP packet loss issues, and then writing lots of code to implement sliding window, then looking into TCPDEV, I can't see how to either get packet size up nor kernel throughput increased. Currently, there's another system design problem with PTYs in that a PTY buffer needs to be bigger than any received data or the system will hang waiting for I/O in the opposite direction (this could be fixed, its on my long list). I don't even know whether smaller packetize is the bottleneck: it could easily be the numerous memcpy's all over the place and or just slower system speed is a basic problem.
@ghaerr https://github.com/ghaerr - I'm assuming there are some low threshold improvements that can be made immediately, any hints as to where to start?
You probably won't like this answer, but I can't think of any immediate improvements, or believe me, I would have done it already!!!
My thinking is that now that we have some stability (ahem, bugfreeness) and physical ethernet support, the debugging and tuning should be much easier than before since we can see the traffic down to the final bit if we like. It's my take that what we have is an impressive feat given the environment in which it was developed. Perhaps now would be a time to port other file transfer programs and/or start using the (slower) networking we have working in more ways. We also need to fix the close issues, as they crash ktcp on occasion.
My suggestion is to start with what we have - which is working (http), add ftp (incoming) soon and take it from there. I was going to port rlogin, rlogind and rcp to ELKS but realized they're not present on any hosts so not sure that's worth the trouble. Then thought about porting tftp but ELKS TCP/IP doesn't support UDP.
If we're sure this it the IP stack we're going for, I'd say adding UDP is a priority. We might want to describe some file transfer scenarios (both automated and manual) that would be big improvements to what we now have, and write or port that code.
That was the purpose of my note in #697.
—Mellvik
Is this something that should be fixed (the single vs dual process architecture)?
No, the PTY code is fine, just has a few issues we need to think about. There are many other priorities in TCP that should come first.
Just last weekend transferred an entire HD image out of the ELKS box (512MB)
Wow, how long did that take?!?!
Is this something that can be improved using - on demand - your heap_alloc() routine?
No, too complicated. But I've added increasing that size to my todo list as something to consider. I'm not sure yet its the bottleneck.
I guess the complexity of this brings up the user process vs kernel stack again.
You mean user vs kernel process? I don't think there's much chance of rewriting all the networking stuff and put it in the kernel, way too much work at this point anyways.
Is the balance between the two changing with the ongoing work on kernel far text segments?
I am hoping so, not for networking, but for making space for all our other improvements. Last night I worked on that. We are currently at ONLY 6k bytes free kernel code space. And moving all the kernel init routines into a far text segment only gave us 3.4k bytes. And that kernel won't boot, likely due to compiler and other issues. So its going to be a long road... and every procedure has to be individually labeled at this point. No easy answers yet, so tighten our kernel code belts!
Just last weekend transferred an entire HD image out of the ELKS box (512MB)
Wow, how long did that take?!?!
I did report the speed - 27,5KB/s. Somewhat better for incoming, but then, that was w/o any local i/o (/dev/null). Anyway, multiplied by 512m gives a BIG number - and a BIG incentive to look for speedups. Is this something that can be improved using - on demand - your heap_alloc() routine?
No, too complicated. But I've added increasing that size to my todo list as something to consider. I'm not sure yet its the bottleneck.
I guess the complexity of this brings up the user process vs kernel stack again.
You mean user vs kernel process? I don't think there's much chance of rewriting all the networking stuff and put it in the kernel, way too much work at this point anyways.
Is the balance between the two changing with the ongoing work on kernel far text segments?
I am hoping so, not for networking, but for making space for all our other improvements. Last night I worked on that. We are currently at ONLY 6k bytes free kernel code space. And moving all the kernel init routines into a far text segment only gave us 3.4k bytes. And that kernel won't boot, likely due to compiler and other issues. So its going to be a long road... and every procedure has to be individually labeled at this point. No easy answers yet, so tighten our kernel code belts!
OK, I have a real incentive now to put some effort into speedups, so if you can think of something to test let me know.
IN the meanwhile I'll fix ftpget, improve urlget and take a look at implementing PUT in the httpd.
—Mellvik
We have very limited buffer space per connection (current 4096 bytes each direction in ktcp) without running out of space in ktcp when having lots of connections open.
How many is 'lots of connections'? Do we need that now?
Well, actually only two TCP connections... but when running just two simultaneous ELKS telnet connections to localhost, we pretty much max out ELKS: 15 processes (1 left to run ps), 4 PTYs, 2 shells, 2 telnets (shared code), 2 telnetd's (forked), maxed out heap (prior to lowering the kernel stack size to 512), plus ktcp and the regular processes. This is the test setup I want to use in testing the improved heap allocator in #744, by ending and restarting the 2nd telnet localhost connection, which will stress the near heap with the large and small PTY allocations when nearly full.
How many is 'lots of connections'? Do we need that now?
Well, actually only two TCP connections... but when running just two simultaneous ELKS telnet connections to localhost, we pretty much max out ELKS: 15 processes (1 left to run ps), 4 PTYs, 2 shells, 2 telnets (shared code), 2 telnetd's (forked), maxed out heap (prior to lowering the kernel stack size to 512), plus ktcp and the regular processes. This is the test setup I want to use in testing the improved heap allocator in #744, by ending and restarting the 2nd telnet localhost connection, which will stress the near heap with the large and small PTY allocations when nearly full.
— Point taken.
I'd love to replace the localhost part with physical - when you're ready.
-- M
Since the previous post on this thread, there have been too many PRs, improvements, test reports and discussions to count. The 'slow but working' status has changed to 'stable and quite fast' - in particular on the TCP level. Here are some test results using the latest build (4512f614) with minor changes to the ne2k driver:
[386/20]
In order to get a metric on system capability vs. network speed, the expansion chassis containing the network card was moved to a 286/12,5MHz machine and tests repeated (modified/fast driver). Both the 286 and the 386 have a cycle to instruction ratio of approx. 4.5, so the performance difference in real mode should be about 1.6. Adding the effects of other HW improvements the expected difference should be about 1:2.
The 286 test is interesting beyond the numbers because the slow speed pushes robustness real hard. Overruns and retransmits all the time, and here's another cork-popper: It is now really hard to crash ktcp. Concurrent file transfers, several telnets both ways active, retransmits & overruns en massé, and we're still running. Flood-pinging the 286 system with large packets eventually sent ktcp into a loop that hung the system. I'm working on creating a repeatable scenario for this, it may be hard on a faster system.
There is one other situation that emitted errors: This is on the 386 and the faster version of the drivers: Having a curl file transfer going, telnetting into elks and then out (so we have a double telnet in effect), cause occasional character loss. Again I'm woking on creating a repeatable scenario. Given the stability of the other testing, it seems unlikely that this is tcp-level... ?
All in all, fantastic improvement - and IMHO time to close this issue.
--mellvik
Hi @Mellvik,
Thanks for your report. Agreed, all-in-all, great improvements seen. I wasn't able to work on most of these issues until I got my real hardware NE2K card working, thanks to you.
I have seen very few system crashes, none repeatable. I did see one outbound ELKS-to-Linux curl transfer end up creating a 23Mb file whose size was proper but contents not correct. Unfortunately I deleted that and don't have it for analysis. So there seems to be a possibility where TCP fails without error, which is concerning. I would suggest further high-error transfers and diff or cmp comparison to see if we can get that to repeat.
There are a couple other high-priority problems with ktcp which still occur regularly:
Lower priority:
The ELKS issues with disk I/O and subsequent loss of TCP/IP speed are not fixed easily, and are the result of using the synchronous BIOS calls rather than interrupt-driven I/O, as well as just slow floppies. This problem is already discussed in #521.
All in all, fantastic improvement - and IMHO time to close this issue.
Agreed, lets close this and report new issues when they arise.
Thank you!
I have seen very few system crashes, none repeatable. I did see one outbound ELKS-to-Linux curl transfer end up creating a 23Mb file whose size was proper but contents not correct. Unfortunately I deleted that and don't have it for analysis. So there seems to be a possibility where TCP fails without error, which is concerning. I would suggest further high-error transfers and diff or cmp comparison to see if we can get that to repeat.
Good point - I'll put that on the top of my list. Along with the telnet losing data. There are a couple other high-priority problems with ktcp which still occur regularly:
Connecting telnet quickly again after disconnect gives "SYN sent, wrong ACK" error. This is because the one half of the TCP connection is still open (timing out after 4 seconds), as shown by netstat. The disconnecting telnet needs to shut down the whole connection properly to fix this. I will work on this next. Yes, I noticed with one while testing ^p with DEBUG_TCP this morning. Seems like there is a 10 second (or maybe 10 rounds of the .5 sec loop), maybe this is useful:
tcpdev: got close from ELKS process TCP: send src:23 dst:44230 flags:11 seq:f856bb44 ack:a4047abd win:255 urg:0 chk:0 len:20 rtt 0 RTT 5 RTO 10 ktcp: update 0,1 expire state 5 expire state 1 expire state 1 TCP: recv src:44230 dst:23 flags:11 seq:a4047abd ack:f856bb45 win:29200 urg:0 chk:0 len:20 TS_FIN_WAIT_1 TCP: send src:23 dst:44230 flags:10 seq:f856bb45 ack:a4047abe win:255 urg:0 chk:0 len:20 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 expire state 1 expire state 1 ktcp: update 0,1 expire state 6 tcp: REMOVING control block expire state 1 expire state 1
The ELKS issues with disk I/O and subsequent loss of TCP/IP speed are not fixed easily, and are the result of using the synchronous BIOS calls rather than interrupt-driven I/O, as well as just slow floppies. This problem is already discussed in #521 https://github.com/jbruchon/elks/issues/521.
Yes, I do realize that - and while on the subject, I propose we keep the hd major devices for a while - I hope to do some testing with the direct drivers to compare, not so much the speed but the interrupt blocking. Eventually.
BTW - I did run across a new problem this morning: I opened a tcp connection from ELKS (ftpget) with the wrong IP address, using 20.0.2.2 instead of 10.0.2.2. This hangs the process and blocks ktcp until reboot. ICMP works, incoming telnet works partly but does not complete a connection. Trying another ftpget using the correct address hangs that process too. Not interruptible, not SIGINTR, not kill -9 externally.
—Mellvik
Final numbers as we close this issue: Still on 386/20MHz compaq, and now with DEBUG_TCP turned off as suggested by @ghaerr:
A new set of records to beat.
-M
I propose we keep the hd major devices for a while - I hope to do some testing with the direct drivers to compare, not so much the speed but the interrupt blocking. Eventually.
The devices will all be there, with the same numbers. Instead of using /dev/hda, you'll use another name for testing. I think it important that ELKS be easy to understand, which is the reason for using /dev/hd* rather than /dev/bd* (shortly). The unused /dev/hd* devices were commented out this last week. Should you start development on the direct driver, you can uncomment one of the /dev/hd* names, which will likely be renamed /dev/dhd* (for direct hd). I think getting the system running under the direct driver will be a major undertaking - none of the interrupt-driven buffer management code, in addition to the block driver itself, has ever been tested, and it is very complicated. In addition, the floppy code will have to use the old driver. All-in-all, a big project for v0.5 or later.
(ftpget) with the wrong IP address, using 20.0.2.2 instead of 10.0.2.2. This hangs the process and blocks ktcp until reboot.
I'll add connection timeout to the list.
Some research on two of these issues, first the good news:
ls -l bigdir
or cat file
where file is larger than a couple of kilobytes) shows data lost. The pattern is
The telnet problem is at least partly reproducible in qemu, but the # of bytes between and the size of the loss are less predictable.
➜ ~ ls -l
total 41184
drwx------@ 4 helge staff 128 Jan 19 2018 Applications
drwx------@ 295 helge staff 9440 Oct 2 15:40 Desktop
drwx------@ 82 helge staff 2624 Sep 15 19:40 Documents
drwx------@ 607 helge staff 19424 Oct 12 14:09 Downloads
drwx------@ 61 helge staff 1952 Feb 24 2020 Dropbox
drwx------@ 20 helge staff 640 Oct 13 09:17 Google Drive
drwx------@ 89 helge staff 2848 Nov 25 2019 Library
-rwxr-xr-x 1 helge staff 32204 Mar 10 2020 Menuconfig
drwxrwxr-x 161 root wheel 5152 Sep 28 2017 Microsoft
drwx------+ 14 helge staff 448 Jan 11 2020 Movies
drwx------+ 8 helge staff 256 Jan 14 2020 Music
drwx------+ 6 helge staff 192 Feb 12 2020 Pictures
drwxr-xr-x+ 4 helgestaff 128 Jan 18 2018 Public
drwxr-xr-x 33 helge staff 1056 May 4 10:42 VirtualBox VMs
drwxr-xr-x 129 helge staff 4128 Jul 12 2018 adresselister-madmimi
-rwxr-xr-x 1 helge staff 1096147 Oct 7 00:16 configure
-rw-r--r-- 1 helge staff 742364 Sep 18 08:57 elks-transfer.log
drwxr-xr-x 23 helge staff 736 Apr 26 2019 monodevelop
-rw-r--r-- 1 helge sta 24967 Aug 14 11:52 ne2k-mac.S-pre-karma
-rw-r--r-- 1 helge staff 0 Apr 23 15:34 new.diff
drwxr-xr-x 36 helge staff 1152 Dec 10 2019 pcjs
-rw-r--r-- 1 helge staff 18333696 Mar 21 2020 rq0-ra81.dsk
drwxr-xr-x 36 helge staff 1152 Sep 14 12:15 src
drwxr-xr-x 50 helge staff 1600 Oct 6 22:23 tmp
➜ ~ rm configure
It also seems - although harder to verify - that a possibly related problem applies to telnet input: If exposed to a paste operation, most of the input gets lost. Not 100% consistent, but the first ~400 (readings: 404-420) chars are accepted, the rest gets dropped.
--Mellvik
Entirely different question - same topic: I've been wondering and forgotten to ask - does the difference between sent packets and received packets (1:2) in netstat have any specific significance or is it a bug?
# netstat
----- Received --------- ----- Sent -------------
TCP Packets 735116 TCP Packets 1465367
TCP Dropped 0 TCP Retransmits 31
TCP Bad Checksum 0 TCP Retrans Memory 0
IP Packets 742578 IP Packets 1472860
IP Bad Checksum 0 IP Bad Headers 0
ICMP Packets 7462 ICMP Packets 7462
SLIP Packets 0 SLIP Packets 0
ETH Packets 742860 ETH Packets 1473142
ARP Reqs Sent 0 ARP Replies Rcvd 0
ARP Reqs Rcvd 363 ARP Replies Sent 363
ARP Cache Adds 2
No State RTT lport raddress rport
-----------------------------------------------------
1 ESTABLISHED 1000ms 1024 0.0.0.0 2
2 ESTABLISHED 62ms 23 10.0.2.59 48742
3 LISTEN 1000ms 80 0.0.0.0 0
4 LISTEN 1000ms 23 0.0.0.0 0
TCP transfer speed: For the record - and for reference
The 386/20 + Eagle ne2k pnp card delivers these numbers on MTCP (packet driver)/DOS 6.22:
That's probably as fast as this HW can be pushed. If we eventually get half of that, we're doing really well.
Hello @Mellvik,
Thanks for your reports! And do the bugs never end? Ugh. I will look into the telnet losing data problem. I hadn't seen it before, but will try duplication on QEMU. I am glad to hear we can't duplicate the curl issue, but I believe the possibility is still there, given enough packet losses or retransmits, since the core of the ktcp code wasn't changed. We only eliminated a driver packet overrun problem and tuned the system to slow down on large send windows.
does the difference between sent packets and received packets (1:2) in netstat have any specific significance ?
If you try ^P just before connecting using telnet (from ELKS to outside), it shows that ktcp always ACKs a received packet separately before sending yet another packet with data. Thus there seems to be twice as many sent packets as received. I originally noticed this inefficient behavior when debugging the NE2K ne2k_pack_put routine, since this is the "back-to-back" sending that caused that routine to not write the second packet on the wire.
I had forgotten about this inefficiency. I'll have to look more deeply to see why it occurs, and what could be done to combine the first ACK into the subsequent TCP data packet. That will probably be a project for v0.5!
Receive (FTP) to disk 240 kbps Send (FTP) to disk 294 kbps
Wow. And here I thought my 386 was fast with ktcp. Seeing that ktcp is sending two packets when it could be sending one, fixing that will surely provide a big speed increase. And of course, we're running a timesharing OS with TCP half in the kernel and half in user land, which definitely slows things down.
Thank you!
@Mellvik,
I can't get the telnet data loss to repeat on QEMU. I'm running an outside telnetd, then starting QEMU and telnetting from ELKS to macOS, all on serial console. Are you running multiple telnet sessions to get this to repeat? Can you send more specific info, thanks!
does the difference between sent packets and received packets (1:2) in netstat have any specific significance ?
If you try ^P just before connecting using telnet (from ELKS to outside), it shows that ktcp always ACKs a received packet separately before sending yet another packet with data. Thus there seems to be twice as many sent packets as received. I originally noticed this inefficient behavior when debugging the NE2K ne2k_pack_put routine, since this is the "back-to-back" sending that caused that routine to not write the second packet on the wire.
I had forgotten about this inefficiency. I'll have to look more deeply to see why it occurs, and what could be done to combine the first ACK into the subsequent TCP data packet. That will probably be a project for v0.5!
Interesting indeed @ghaerr. And yes, I agree, merged acks sound like something for 0.5. Still - this deficiency will, like you say, primarily affect telnet sessions since file transfers in general work this way. So your reply got me thinking: Most of the traffic in my sample netstat was curl - files coming out of ELKS, so maybe we're just seeing the recipient optimizing acks, acking only every other packet. I'll have to tcptrace that. We know from previous traces that ktcp is fine with that. Knowing that it works is good too. Receive (FTP) to disk 240 kbps Send (FTP) to disk 294 kbps
Wow. And here I thought my 386 was fast with ktcp. Seeing that ktcp is sending two packets when it could be sending one, fixing that will surely provide a big speed increase. And of course, we're running a timesharing OS with TCP half in the kernel and half in user land, which definitely slows things down.
Indeed we have some upward potential, in particular incoming traffic, which is now about 20% of this. Then again - if improvement continues at the speed it has recently, … :-)
—Mellvik
I can't get the telnet data loss to repeat on QEMU. I'm running an outside telnetd, then starting QEMU and telnetting from ELKS to macOS, all on serial console. Are you running multiple telnet sessions to get this to repeat? Can you send more specific info, thanks!
My exact setup is
—Mellvik
So your reply got me thinking: Most of the traffic in my sample netstat was curl - files coming out of ELKS, so maybe we're just seeing the recipient optimizing acks, acking only every other packet. I'll have to tcptrace that. We know from previous traces that ktcp is fine with that. Knowing that it works is good too.
Turns out that is indeed the case. After receiving about 25k bytes, the recipient (linux/raspian) starts acking every other packet.
So it makes sense after all…
—Mellvik
- qemu on Macos
- telnet from the mac to elks
- start a telnetd on the mac
- telnetting back to macOS from ELKS
- ls -l
Hmmm, works fine over here. I am running Homebrew telnet and telnetd, installed via "brew install telnet telnetd".
Telnet from macOS to ELKS via "telnet localhost 2323". Telnetd is started via "/usr/local/sbin/telnetd -debug 23 &". Telnet from ELKS to macOS via "telnet 192.168.0.10". Running "ls -l" on either side works without loss. Also tried "ls -lR /" on ELKS, and longer directories on macOS.
Using FWD="hostfwd=tcp:127.0.0.1:8080-10.0.2.15:80,hostfwd=tcp:127.0.0.1:2323-10.0.2.15:23" in qemu.sh.
Very strange indeed. It's exactly the same - except I'm telnetting to 10.0.2.2 from elks.
Will look closer in the morning.
-M
- okt. 2020 kl. 19:16 skrev Gregory Haerr notifications@github.com:
qemu on Macos telnet from the mac to elks start a telnetd on the mac telnetting back to macOS from ELKS ls -l Hmmm, works fine over here. I am running Homebrew telnet and telnetd, installed via "brew install telnet telnetd".
Telnet from macOS to ELKS via "telnet localhost 2323". Telnetd is started via "/usr/local/sbin/telnetd -debug 23 &". Telnet from ELKS to macOS via "telnet 192.168.0.10". Running "ls -l" on either side works without loss. Also tried "ls -lR /" on ELKS, and longer directories on macOS.
Using FWD="hostfwd=tcp:127.0.0.1:8080-10.0.2.15:80,hostfwd=tcp:127.0.0.1:2323-10.0.2.15:23" in qemu.sh.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I am glad to hear we can't duplicate the curl issue, but I believe the possibility is still there, given enough packet losses or retransmits, since the core of the ktcp code wasn't changed. We only eliminated a driver packet overrun problem and tuned the system to slow down on large send windows.
OK, I was thinking some of the many adjustments you made, particularly at the buffer level, may have inadvertently (!) fixed the issue. I've been beating the system up pretty hard while testing in order to get the issue to repeat (lots of retransmits, overruns etc. so it's indeed surprising that the error didn't show.
So - I ran sum() on the fill disk (512MB) transferred from elks, and sum() on the raw device on elks where the data cam from (which took more than an hour) - indeed they are different. So the search continues.
IN the process I fixed a bug in sum(), which has been reporting wrong block count, and an inconvenience in dd() which did not accept standard input and output. PR coming for those two.
—Mellvik
Hi @Mellvik,
I haven't made any changes to the system at the buffer level. There could be a number of causes for this, given the number of issues we've seen in ELKS it could be unrelated to TCP and instead file I/O issues.
I ran sum() on the fill disk (512MB) transferred from elks, and sum() on the raw device on elks where the data cam from (which took more than an hour) - indeed they are different.
For debugging this issue, I would rather this be tested using known good tools, for instance transferring a file out of ELKS to Linux, and using the Linux cmp or diff command to compare binaries from different transfers. If it is a TCP issue, it will likely be far easier to debug the telnet issue first (which doesn't currently repeat on my system), which may be the root cause. It isn't clear running sum on a char device works properly on ELKS, for instance.
IN the process I fixed a bug in sum(), which has been reporting wrong block count, and an inconvenience in dd() which did not accept standard input and output. PR coming for those two.
LOL, not surprised - you're using a couple new ELKS programs!!
I ran sum() on the fill disk (512MB) transferred from elks, and sum() on the raw device on elks where the data cam from (which took more than an hour) - indeed they are different.
For debugging this issue, I would rather this be tested using known good tools, for instance transferring a file out of ELKS to Linux, and using the Linux cmp or diff command to compare binaries from different transfers.
Yes, that's essentially what I've been doing, using partial transfers from the raw drive and comparing them on Linux. The cmp on elks was a one-timer, and it's a good point that this may not be entirely predictable. (then there is another source of errors - If I happen to mount the drive in between I'm guaranteed to have differences). So I'll switch to a big static file instead. Thing is - When I found everything to be OK yesterday, I had transferred some 50MB 3-4 times via curl with no differences, no errors. If it is a TCP issue, it will likely be far easier to debug the telnet issue first (which doesn't currently repeat on my system), which may be the root cause. It isn't clear running sum on a char device works properly on ELKS, for instance.
That's a good point. I have more or less assumed it had to be a PTY issue, since data is actually lost, which in the file transfer case (curl), data is corrupted, not lost. IN the process I fixed a bug in sum(), which has been reporting wrong block count, and an inconvenience in dd() which did not accept standard input and output. PR coming for those two.
LOL, not surprised - you're using a couple new ELKS programs!!
Indeed - and I guess your list is longer than mine . all the small things and some bigger… BTW - while continuing to beat up ktcp, I'm getting closer to a predictable hang situation under heavy load. We're getting there.
—Mellvik
I have more or less assumed it had to be a PTY issue, since data is actually lost,
For telnet, that seems to be the case. I would like to know how you confirmed it is losing 504-512 bytes every other time.
which in the file transfer case (curl), data is corrupted, not lost.
Knowing what the corrupted data has actually been replaced with would be interesting. Perhaps sending a very large text file that can be visually looked at would help. Is the data replaced by another disk block/tcp packet, or garbage? Stuff like that.
I'm getting closer to a predictable hang situation under heavy load.
That's going to be another hard-to-debug issue, not looking forward to that one.
I am thinking it might be good to get a 0.4 version out before continuing to tackle the endless supply of bugs...
@ghaerr,
more testing has exonerated tcp from the list of suspects in this issue. Incoming (to ELKS) was never in question, outgoing looked suspicious for a while, but tests this weekend - transfers totalling more than 4 mill outgoing packets, partly in an idle system, partly shared with other traffic, retransmits etc., have been consistently 100% correct.
BTW - the outgoing speed is consistent - 75.6k bytes per sec.
We're back to investigating lost data in telnet and/or ptys.
—Mellvik
- okt. 2020 kl. 18:19 skrev Gregory Haerr notifications@github.com:
I have more or less assumed it had to be a PTY issue, since data is actually lost,
For telnet, that seems to be the case. I would like to know how you confirmed it is losing 504-512 bytes every other time.
which in the file transfer case (curl), data is corrupted, not lost.
Knowing what the corrupted data has actually been replaced with would be interesting. Perhaps sending a very large text file that can be visually looked at would help. Is the data replaced by another disk block/tcp packet, or garbage? Stuff like that.
I'm getting closer to a predictable hang situation under heavy load.
That's going to be another hard-to-debug issue, not looking forward to that one.
I am thinking it might be good to get a 0.4 version out before continuing to tackle the endless supply of bugs...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/745#issuecomment-709436313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGODALMVY6CV3EWIH6CTSK4OHVANCNFSM4RADVHWA.
Hello @Mellvik,
Thanks for your continued testing on this bug.
We're back to investigating lost data in telnet and/or ptys.
I've been so deep in other matters, I'm finding it hard to remember the details of this: Is the error only occurring when telnetting in to ELKS and using our telnetd?
Something to try: since you're saying we're losing approx 504 bytes every other 512 bytes, there is the chance that the PTY character queue is overflowing, which would drop 512 characters. It is the only queue that is 512 bytes long, so a likely suspect. In order to test this theory, change the following line in elks/include/linuxmt/ntty.h:
#define PTYOUTQ_SIZE 512 /* pty output queue size (=TDB_WRITE_MAX and telnetd buffer)*/
to another number, like perhaps 800, or 400 and make clean. The size doesn't have to be a power of two anymore. If the new "dropped characters" amount is changed to near the new number, then we've confirmed that its the PTY driver dropping the characters. I don't yet know why that would be, lets try this first.
If this doesn't change anything, the other culprit will be telnetd itself, which could possibly be losing an entire block of incoming or outgoing data somehow.
Thank you!
Thank you @ghaerr, That's a head start as I dive back into this tomorrow.
BTW - the reliability of elks data transfers is really encouraging. We're moving gigabytes back and forth without promblems at reasonable speeds on old klunkers. Comparing to where we were just a few months back, well, I guess I have mentioned it before: worth a 'skaal'!
--Mellvik
- nov. 2020 kl. 19:27 skrev Gregory Haerr notifications@github.com: Hello @Mellvik,
Thanks for your continued testing on this bug.
We're back to investigating lost data in telnet and/or ptys.
I've been so deep in other matters, I'm finding it hard to remember the details of this: Is the error only occurring when telnetting in to ELKS and using our telnetd?
Something to try: since you're saying we're losing approx 504 bytes every other 512 bytes, there is the chance that the PTY character queue is overflowing, which would drop 512 characters. It is the only queue that is 512 bytes long, so a likely suspect. In order to test this theory, change the following line in elks/include/linuxmt/ntty.h:
define PTYOUTQ_SIZE 512 / pty output queue size (=TDB_WRITE_MAX and telnetd buffer)/
to another number, like perhaps 800, or 400 and make clean. The size doesn't have to be a power of two anymore. If the new "dropped characters" amount is changed to near the new number, then we've confirmed that its the PTY driver dropping the characters. I don't yet know why that would be, lets try this first.
If this doesn't change anything, the other culprit will be telnetd itself, which could possibly be losing an entire block of incoming or outgoing data somehow.
Thank you!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I guess I have mentioned it before: worth a 'skaal'!
Is that a Norwegian beer... or a toast?
You guessed it! Cheers!
M
- nov. 2020 kl. 20:47 skrev Gregory Haerr notifications@github.com:
I guess I have mentioned it before: worth a 'skaal'!
Is that a Norwegian beer... or a toast?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
First, networking status summary per september 8th, 2020: [@ghaerr - please correct if my details are off]
urlget
is reliable for incoming data.urlget
is alsoftpget
, which currently does not work but 'in the works'.Stable TCP-level communication enables file transfers (#697) to/from ELKS, and it's time to get transfer speeds from 'working' to 'useable'. Currently, transfer speed from ELKS (IDE harddisk) via HTTP is 25.5KB/s, packet size 140 bytes (Compaq Portable 386/20, ISA16 NE2k card). For incoming file transfer (to /dev/null) the advertised window size is 255, effective packet size is 295, speed 36KB/s. We may be able to improve both by an order of magnitude, but half of that is enough to make for a very useable system.
@ghaerr - I'm assuming there are some low threshold improvements that can be made immediately, any hints as to where to start?
--Mellvik