ghaerr / elks

Embeddable Linux Kernel Subset - Linux for 8086

Other

1.01k stars 108 forks source link

ktcp status #610

Closed Mellvik closed 4 years ago

Mellvik commented 4 years ago

I believe it makes sense to open a new issue on ktcp: It has not been tested in a while, and much has changed in and around the kernel since.

I fired up KTCP on a physical machine this morning, and the overall status is: It does not work. Then the good news:

With DEBUG enabled, all the initializing debug messages from ktcp prints fine.
There is (initially) no response when poking ELKS from the outside. Not telnet, not ping, not lynx.
killing telnetd gives one successful echo reply (ping). Killing http -> 3 successful echo replies
starting https again -> another 3 replies.
New kill, 7 replies and so forth.
So, when kicked one way or the other, KTCP works somewhat, indicating packet flow, and possibly something broken in the interrupt chain.
Possibly interesting observations:
- After a series of ping responses, KTCP eats 40 ping packets, then the client (from which the ping is coming) is reporting Host Unreachable again. This repeats itself consistently, regardless of how many pings were successfully echoed.
- Further, when the next batch of successful pings come, the sequence number continues from the previous successful batch (3-4) then jumps about 30, and another 3-4.
Finally, when telnet out from ELKS, I get a Connecting to 10.0.2.1 message and maybe 20 successful echoes. Once I even had something coming back from the other end. Garbled, but in terms of byte count, something like a login prompt.
The telnet client on ELKS seem to have issues when run from a serial line, triggers a lot of Bad file number` messages on the console.

This is from ELKS:

# ps
  PID   GRP  TTY USER STAT CSEG DSEG  HEAP   FREE   SIZE COMMAND
    1     0      root    S 3e46 3ff0  3072  13252  26800 /bin/init 
    8     8    1 root    S 4e93 5030     0  16275  27344 /bin/getty /dev/tty1 
    9     9   S0 root    S 5540 611d  2193  13914  75648 -/bin/sh 
   10     9   S0 root    S 67b8 6bca  6144  10045  44832 ktcp 10.0.2.15 /dev/eth 
   23    23      root    S 49ca 72aa     0  16253  26864 httpd 
   21    21      root    S 4b6c 44d1     0  16251  25408 telnetd 
   24     9   S0 root    R 4ca7 7ec0  1025  14054  26672 ps 
# telnet 10.0.2.1
Connecting to 10.0.2.1 port 23

This is from the pinging host:

From 10.0.2.1 icmp_seq=67 Destination Host Unreachable
From 10.0.2.1 icmp_seq=68 Destination Host Unreachable
From 10.0.2.1 icmp_seq=69 Destination Host Unreachable
From 10.0.2.1 icmp_seq=70 Destination Host Unreachable
64 bytes from 10.0.2.15: icmp_seq=41 ttl=64 time=34171 ms
64 bytes from 10.0.2.15: icmp_seq=42 ttl=64 time=33151 ms
64 bytes from 10.0.2.15: icmp_seq=43 ttl=64 time=32128 ms
64 bytes from 10.0.2.15: icmp_seq=71 ttl=64 time=3096 ms
64 bytes from 10.0.2.15: icmp_seq=72 ttl=64 time=2069 ms
64 bytes from 10.0.2.15: icmp_seq=73 ttl=64 time=1042 ms
64 bytes from 10.0.2.15: icmp_seq=74 ttl=64 time=54.2 ms
From 10.0.2.1 icmp_seq=109 Destination Host Unreachable
From 10.0.2.1 icmp_seq=110 Destination Host Unreachable
From 10.0.2.1 icmp_seq=111 Destination Host Unreachable
From 10.0.2.1 icmp_seq=112 Destination Host Unreachable
From 10.0.2.1 icmp_seq=113 Destination Host Unreachable
From 10.0.2.1 icmp_seq=114 Destination Host Unreachable
From 10.0.2.1 icmp_seq=115 Destination Host Unreachable
From 10.0.2.1 icmp_seq=116 Destination Host Unreachable
From 10.0.2.1 icmp_seq=117 Destination Host Unreachable
From 10.0.2.1 icmp_seq=118 Destination Host Unreachable
From 10.0.2.1 icmp_seq=119 Destination Host Unreachable
From 10.0.2.1 icmp_seq=120 Destination Host Unreachable
From 10.0.2.1 icmp_seq=121 Destination Host Unreachable
From 10.0.2.1 icmp_seq=122 Destination Host Unreachable
From 10.0.2.1 icmp_seq=123 Destination Host Unreachable
From 10.0.2.1 icmp_seq=124 Destination Host Unreachable
From 10.0.2.1 icmp_seq=125 Destination Host Unreachable
From 10.0.2.1 icmp_seq=126 Destination Host Unreachable
64 bytes from 10.0.2.15: icmp_seq=75 ttl=64 time=54498 ms
64 bytes from 10.0.2.15: icmp_seq=76 ttl=64 time=53465 ms
64 bytes from 10.0.2.15: icmp_seq=77 ttl=64 time=52430 ms
64 bytes from 10.0.2.15: icmp_seq=78 ttl=64 time=51394 ms
64 bytes from 10.0.2.15: icmp_seq=79 ttl=64 time=50359 ms
64 bytes from 10.0.2.15: icmp_seq=80 ttl=64 time=49323 ms
64 bytes from 10.0.2.15: icmp_seq=106 ttl=64 time=22287 ms
64 bytes from 10.0.2.15: icmp_seq=107 ttl=64 time=21253 ms
64 bytes from 10.0.2.15: icmp_seq=108 ttl=64 time=20218 ms
64 bytes from 10.0.2.15: icmp_seq=127 ttl=64 time=487 ms
From 10.0.2.1 icmp_seq=162 Destination Host Unreachable
From 10.0.2.1 icmp_seq=163 Destination Host Unreachable
From 10.0.2.1 icmp_seq=164 Destination Host Unreachable
From 10.0.2.1 icmp_seq=165 Destination Host Unreachable

ghaerr commented 4 years ago

Thanks for testing!

However, bad news. It used to work, I'm almost sure Marc was using it years ago, but that could have been slip only.

A couple questions: I assume you're using it on a network card only? What is the startup command line exactly? Which source version are you running, the version before or after Christoph's mods?

@pawosm-arm has tested using slip, and said it works at 300 baud, except for incoming connections to telnetd.

In summary - we need to get to a working version, be it 300 baud slip or ethernet, before or after the waiting PR. If needed, we could go way back to an earlier kernel and/or ktcp version, as it will be much easier to debug from something working than apparently where it is now. And I will have to use an emulator, of which I don't have slip or NE2000 running yet. Ugh!

pawosm-arm commented 4 years ago

Indeed, I managed to establish telnet connection from ELKS to Linux over SLIP (with ktcp on the ELKS side and slattach on the Linux side), having my own custom telnet service started on the Linux side (see [1]). This telnet server imitation enables interaction similar to a serial line connection using miniterm on the ELKS side and minicom on the Linux side. Except it is much slower at the same baud rate selected. Terribly slow I should say. Yet definitely, the packages are going both sides. The slowness of the communication is extremely suspicious though.

[1]. https://sourceforge.net/projects/stdiotelnetd

Mellvik commented 4 years ago

Good news. The symptoms were just too suspicious - like you said @ghaerr, it used to work.

A few hours later, via DOS and misc packet drivers - different machine and different NE2K - and we're sort of operational.

Or at least, we have two way traffic. Very unstable - 'select' runaway errors and hangs, so this will take some time. I'm using a commit from 24 hrs ago, the '-s 4800' is on the command line, but doesn't seem to affect anything.

netstat works:

netstat

Retransmition memory : 0 bytes Number of control blocks : 3

no State RTT lport raddress rport

1 ESTABLISHED 4000ms 1024 0.0.0.0 2 2 LISTEN 4000ms 80 0.0.0.0 0 3 LISTEN 4000ms 23 0.0.0.0 0

Ping works 23 times just after reboot, then silence (seems like ktcp hangs). Incoming (to ELKS) telnet connects, but doesn't spawn a login or shell. Incoming lynx is successful - even after the telnet, which has to be terminated at the source. New netstat with lynx active:

netstat

Retransmition memory : 0 bytes Number of control blocks : 4

no State RTT lport raddress rport

1 ESTABLISHED 4000ms 1024 0.0.0.0 2 2 CLOSE_WAIT 4000ms 23 10.0.2.1 -32394 3 LISTEN 4000ms 80 0.0.0.0 0 4 LISTEN 4000ms 23 0.0.0.0 0

Next command (nslookup) causes hang (after reporting 'Nameserver queried: 203……).

Testing continues - hints at specific things to test appreciated.

—Mellvik

mai 2020 kl. 17:09 skrev Gregory Haerr notifications@github.com:

Thanks for testing!

However, bad news. It used to work, I'm almost sure Marc was using it years ago, but that could have been slip only.

A couple questions: I assume you're using it on a network card only? What is the startup command line exactly? Which source version are you running, the version before or after Christoph's mods?

@pawosm-arm https://github.com/pawosm-arm has tested using slip, and said it works at 300 baud, except for incoming connections to telnetd.

In summary - we need to get to a working version, be it 300 baud slip or ethernet, before or after the waiting PR. If needed, we could go way back to an earlier kernel and/or ktcp version, as it will be much easier to debug from something working than apparently where it is now. And I will have to use an emulator, of which I don't have slip or NE2000 running yet. Ugh!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/610#issuecomment-623521008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOAEWPOKWVX362ZFP23RP3LA3ANCNFSM4MYX5P5A.

Mellvik commented 4 years ago

Sign of life:

ghaerr commented 4 years ago

You're breathing life into it Helge!!

Things sound a bit promising, since its working for at least a little while. I'm a bit curious as to your switching your DOS packet driver and now ELKS works better, untouched. So there could be issues on the DOS end also?

I'd like to see screenshots of the select printks, etc.

With regards to the older 24-hour commit, that's probably ok. The -s option that was added only affects slip operation, the other code wasn't changed, just cleanup.

However - telnetd.c WAS changed, a one-liner at line 28, where /bin/login is execed, rather than /bin/sh. You can change that back and recompile telnetd to see if that changes anything and you get a shell prompt. It wasn't working for Paul.

If you're running with full ktcp debugging, perhaps a screenshot or two when its hung... however I can't duplicate anything over here yet on QEMU.

Mellvik commented 4 years ago

Things sound a bit promising, since its working for at least a little while. I'm a bit curious as to your switching your DOS packet driver and now ELKS works better, untouched. So there could be issues on the DOS end also?

I figured there had to be an interrupt problem, so I boded DOS and could't get packets in, only out. So - like I mentioned, I switched machines and Ethernet cards. Turned out there was indeed an interrupt problem with the first NE2K card - (soft configurable and I didn't find the tool to change it), so I dug out an older card... I'd like to see screenshots of the select printks, etc.

With regards to the older 24-hour commit, that's probably ok. The -s option that was added only affects slip operation, the other code wasn't changed, just cleanup.

Commit is af736b67. However - telnetd.c WAS changed, a one-liner at line 28, where /bin/login is execed, rather than /bin/sh. You can change that back and recompile telnetd to see if that changes anything and you get a shell prompt. It wasn't working for Paul.

If you're running with full ktcp debugging, perhaps a screenshot or two when its hung... however I can't duplicate anything over here yet on QEMU.

Coming!

-M

Mellvik commented 4 years ago

Boot messages & hang situation - after some activity. Command triggering the hang:

5283C915-18D4-40C6-AAAE-AA8DE5219F91 8C939246-15BD-43B5-91F5-46D4FF4B490D

ghaerr commented 4 years ago

Ok - so you're running ktcp directly from the /etc/rc.d/rc.sys startup script. That's fine. However, I added the untested "-b" (run in background) option, but looks like the error messages are working, so we're probably ok there. It used to be run from the shell with "&", but that closes output file descriptors so I added the option to make things cleaner.

The 'select: Bad file number' is not a kernel printk. It is coming from the select() call in arp.c. After @cjsthompson's first commit to his outstanding cleanup PR, we realized that the "static int tcpdevfd" was in error in multiple source files, and was cleaned up (that is, all occurrences combined). See discussion in #507 for details. Since there is only one file number in that select(), and that is tcpdevfd, which is static and = 0, that's the problem for sure.

As a result, I suggest adding Christoph's PR #607, and recompiling all of ktcp, I think it will fix that error.

Interestingy, its possible that my -b option broke the incorrectly coded arp select() option - that is, when run async from the shell with "&", tcpdevfd would be the first file opened, and thus have the value of 0 - which would have matched an uninitialized static variable!!

We're making progress!

Thank you!

ghaerr commented 4 years ago

I got QEMU to run with the network card finally, and turned on CONFIG_ETH in .config, a couple things to report:

The "static int tcpdevfd" is definitely the problem with the "select Bad file number", all static int declarations of it need "static" removed, or use PR #607.

Also, it seems that the -b option I added some time back may be affecting operation. For the time being, I changed some lines in rootfs_template/etc/rc.d/rc.sys to remove it and run it async from the shell:

    echo 'Starting networking: ktcp'
    # run ktcp as background daemon if successful starting networking
    #if ktcp -b -s $ttybaud $localip $interface ; then
    ktcp  $localip $interface &
    if true ; then
        for daemon in telnetd httpd
        do

This seems to allow netstat to interoperate with ktcp. I still can't get telnet to work. I am running everything within ELKS QEMU at this point. I am not sure how to test httpd within ELKS.

Mellvik commented 4 years ago

Ok - so you're running ktcp directly from the /etc/rc.d/rc.sys startup script. That's fine.

Yes and no. During testing, I've started the processes manually most of the time, then killing and restarting as required. Having https and telnet active prevents elvis from running (out of memory). [I really need to figure out the elks ed editor, I expected it to be like unix ed. It's not.] However, I added the untested "-b" (run in background) option, but looks like the error messages are working, so we're probably ok there. It used to be run from the shell with "&", but that closes output file descriptors so I added the option to make things cleaner.

This does not seem to make any difference, i.e the -b options seems fine. The 'select: Bad file number' is not a kernel printk. It is coming from the select() call in arp.c. After @cjsthompson https://github.com/cjsthompson's first commit to his outstanding cleanup PR, we realized that the "static int tcpdevfd" was in error in multiple source files, and was cleaned up (that is, all occurrences combined). See discussion in #507 https://github.com/jbruchon/elks/issues/507 for details. Since there is only one file number in that select(), and that is tcpdevfd, which is static and = 0, that's the problem for sure.

As a result, I suggest adding Christoph's PR #607 https://github.com/jbruchon/elks/pull/607, and recompiling all of ktcp, I think it will fix that error.

Thanks - I'll pull it in and check. Interestingy, its possible that my -b option broke the incorrectly coded arp select() option - that is, when run async from the shell with "&", tcpdevfd would be the first file opened, and thus have the value of 0 - which would have matched an uninitialized static variable!!

We're making progress!

—Mellvik

ghaerr commented 4 years ago

Having https and telnet active prevents elvis from running (out of memory).

Are you saying that running httpd and telnetd STOP elvis from out of memory? It should be the other way around.

vi seems to be the application that uses the most memory at this time. Back when configurable L2 buffers was added, the default was increased from 64K to 128k of buffers. In further heavy ELKS use with all gettys running and multiple logins, I have seen ELKS run out of memory when trying to run vi. Probably not worth the tradeoff of 128k buffers when vi won't run, and now networking is active. I'm thinking that decreasing the default L2 buffers back to somewhere between 64-96k would be a good idea so vi always runs.

This does not seem to make any difference, i.e the -b options seems fine.

I'm seeing definite differences in the debug output of ktcp and netstat, so it might be best to always run ktcp from the command line for our initial testing - FYI.

Mellvik commented 4 years ago

Having https and telnet active prevents elvis from running (out of memory).

Are you saying that running httpd and telnetd STOP elvis from out of memory? It should be the other way around.

No - consider the parenthesized content a possible explanation, not a continuation of the sentence. vi seems to be the application that uses the most memory at this time. Back when configurable L2 buffers was added, the default was increased from 64K to 128k of buffers. In further heavy ELKS use with all gettys running and multiple logins, I have seen ELKS run out of memory when trying to run vi. Probably not worth the tradeoff of 128k buffers when vi won't run, and now networking is active. I'm thinking that decreasing the default L2 buffers back to somewhere between 64-96k would be a good idea so vi always runs.

This does not seem to make any difference, i.e the -b options seems fine.

I'm seeing definite differences in the debug output of ktcp and netstat, so it might be best to always run ktcp from the command line for our initial testing - FYI.

Noted.

-M

Mellvik commented 4 years ago

More testing today:

added the changes from #607, removing the static declarations etc. Now the system hangs on any and all network accesses.

There is an additional complication that may affect the situation:

Some time over the last week a change has badly affected the XT keyboard driver. The keyLEDs (capslock, numlock etc. no longer work, keypad send wrong characters).
In the same timeframe a change in the serial driver (presumably) has made serial access non functional: Output is 'eating' most of the data, less than half of the bytes sent get to the terminal window. Input seems ok but it's hard to tell when output is incomplete,.

More testing on this tomorrow - @ghaerr, if you have any immediate guesses, I'd appreciate it. This probably belongs in its own thread.

--Mellvik

ghaerr commented 4 years ago

added the changes from #607, removing the static declarations etc. Now the system hangs on any and all network accesses.

Ok, a bit of a mess, when combined with the XT kbd and serial driver problems (please comment on those in #612).

Since #607 is not committed, we'll keep that on hold. Please do a 'git checkout -f master' and recompile from scratch, and we'll start working on each of the three problems separately.

I will submit a small PR that only changes the 'static int tcpdevfd' fix that is required for the arp.c "select bad file descriptor" problem. That will allow you to continue ktcp testing.

See #612 for serial and kbd fixes.

ghaerr commented 4 years ago

Helge, here's a minimal patch to fix the "select" and "static int tcpdevfd" errors without yet using the full PR #607. This works on my system using the latest git HEAD.

This patch also starts ktcp asynchronously from sys.rc rather than using the -b option, to keep previous changes to a minimum for testing.

I've reviewed #607 and can't see why it causes system hangs, so I'd like to see whether this does before I test 607 on my own machine. netpatch.txt

Mellvik commented 4 years ago

Here's a serial style screen dump from a minimal tcp test. which also tests /bootopts for the first time. This is indeed a great improvement for debugging. printk needs some CRs to improve readability, otherwise great. There is a ping running from the other end of the ethernet link, which is where the ICMPs are coming from.

# reboot
ttyS0 at 0x3f8, irq 4 is a 16450
                                ttyS1 at 0x2f8, irq 3 is a 16550A
                                                                 ttyS2 at 0x3e8, irq 5 is a 16550A
                  lp0 at 0x3bc, using polling driver
                                                    [eth] NE2K driver OK
                                                                        bioshd: gethdinfo CHS 979,5,17
                      bioshd: 1 floppy drive and 1 hard drive
                                                             Partitions: bda:(0,83215)  bda1:(1,6374)  bda2:(6375,1275)  bda3:(7650,41565)  bda4:(49215,34000) 
                                                                               device_setup: BIOS drive 0x0, root device 0x380
                                              PC/AT class machine, Intel 80286 CPU, 640K base RAM.
                  ELKS kernel 0.3.0 (56256 text + 6822 data + 53038 bss + 5676 heap)
    Kernel text at c0:0000, data at e7c:0000, 518K for user processes.
                                                                      fd: found valid ELKS disk parameters on /dev/fd0 boot sector
                                                  fd: /dev/fd0 probably has 18 sectors, 2 heads, and 80 cylinders
                                 VFS: Mounted root (minix filesystem).
                                                                      Running /etc/rc.d/rc.sys script
Starting networking: ktcp
KTCP: 1. local_ip

KTCP: 2. init tcpdev
KTCP: 3. init interface
Init /dev/eth
KTCP: 4. ip_init()
KTCP: 5. icmp_init()
 telnetd KTCP: 6. tcp_init()
KTCP: 7. netconf_init()
KTCP: 8. ktcp_run()
tcpdev_process : read 12 bytes
IP : ICMP packet
tcpdev_process : read 6 bytes
 httpd IP : ICMP packet
tcpdev_process : read 8 bytes
IP : ICMP packet
tcpdev_process : read 12 bytes
IP : ICMP packet
tcpdev_process : read 6 bytes
tcpdev_process : read 8 bytes
ELKS built from commit ae1aa713
Fri May 08 12:39:19 2020
IP : ICMP packet
IP : ICMP packet

IP : ICMP packet

EIP : ICMP packet
LKS 0.3.0

login: IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet
IP : ICMP packet

panic: Unable to get timer
                          apparent call stack:
                                              Line: Addr    Parameters
                                                                      ~~~~: ~~~~    ~~~~~~~~~~   
                 0: 87B0 => E9D6 F17A 7200 6F72 0A72 6400 7665   
                                                                 1: F136 => 1001 78E9 55FE E589 06FF 0300 468A   
                                 2: B8FE => 0000 0000 0000 0000 0000 0000 0000 
                                                                                 3: 0000 => 692F 696E 0074 642F 7665 632F 6E6F   
                                                 4: 6E69 => 0000 0000 0000 0000 0000 0000 0000   
                 5: 0000 => 692F 696E 0074 642F 7665 632F 6E6F   
                                                                 6: 6E69 => 0000 0000 0000 0000 0000 0000 0000   
                                 7: 0000 => 692F 696E 0074 642F 7665 632F 6E6F 
                                                                                 8: 6E69 => 0000 0000 0000 0000 0000 0000 0000
                                              SYSTEM HALTED - Press CTRL-ALT-DEL to reboot:

BTW - the halt at this point is hard, not soft - need power recycling.

--M

Mellvik commented 4 years ago

Continuation from the previous comment: rebooting w/o the ongoing PING, and instead telnetting into elks, leaves us with the following. It's not consistent in that it doesn't stop at the same place every time, and sometimes just loops in retransmits. The elks telnetd responds with login: but continues after a few seconds (doesn't wait for newline) and execs /bin/login with whatever was typed, and emits 'Password:', where it actually waits. The tty (pty) echoes control characters like '^M' etc.

Anyway, this is not bad at all.

ELKS built from commit ae1aa713
Fri May 08 13:08:14 2020

ELKS 0.3.0

login: IP : TCP packet
Retrans buffers : 1 retrans : 4294942271 4294942271 remove
IP : TCP packet
IP : TCP packet
Retrans buffers : 1 retrans : 4294942272 4294942272 remove
IP : TCP packet
tcpdev_process : read 108 bytes
Retrans buffers : 2 retrans : 4294942279 4294942272not
retrans : 4294942272 4294942272 remove
IP : TCP packet
tcpdev_process : read 8 bytes
Retrans buffers : 2 retrans : 4294942279 4294942272not
retrans : 4294942279 4294942272not
IP : TCP packet
Retrans buffers : 3 retrans : 4294942279 4294942272not
retrans : 4294942279 4294942272not
retrans : 4294942279 4294942272not
IP : TCP packet
tcpdev_process : read 108 bytes
Retrans buffers : 5 retrans : 4294942287 4294942272not
retrans : 4294942279 4294942272not
retrans : 4294942279 4294942272not
retrans : 4294942279 4294942272not
retrans : 4294942279 4294942272not
IP : TCP packet
tcpdev_process : read 108 bytes
Retrans buffers : 6 retrans : 4294942288 4294942279not
retrans : 4294942287 4294942279not
retrans : 4294942279 4294942279 remove
retrans : 4294942279 4294942279 remove
retrans : 4294942279 4294942279 remove
retrans : 4294942279 4294942279 remove
IP : TCP packet
tcpdev_process : read 108 bytes
Retrans buffers : 3 retrans : 4294942296 4294942287not
retrans : 4294942288 4294942287not
retrans : 4294942287 4294942287 remove
IP : TCP packet
Retrans buffers : 3 retrans : 4294942296 4294942287not
retrans : 4294942296 4294942287not
retrans : 4294942288 4294942287not
IP : TCP packet
tcpdev_process : read 8 bytes
Retrans buffers : 3 retrans : 4294942296 4294942288not
retrans : 4294942296 4294942288not
retrans : 4294942288 4294942288 remove
IP : TCP packet
Retrans buffers : 2 retrans : 4294942296 4294942296 remove
retrans : 4294942296 4294942296 remove
IP : TCP packet
Retrans buffers : 1 retrans : 4294942296 4294942296 remove
tcpdev_process : read 8 bytes
IP : TCP packet
Retrans buffers : 1 retrans : 4294942296 4294942296 remove
tcpdev_process : read 8 bytes
tcpdev_process : read 108 bytes
Retrans buffers : 1 retrans : 4294942298 4294942296not
IP : TCP packet
Retrans buffers : 1 retrans : 4294942298 4294942298 remove
tcpdev_process : read 108 bytes
Retrans buffers : 1 retrans : 4294942325 4294942298not
IP : TCP packet
Retrans buffers : 1 retrans : 4294942325 4294942325 remove
tcpdev_process : read 108 bytes
Retrans buffers : 1 retrans : 4294942326 4294942325not
IP : TCP packet
Retrans buffers : 1 retrans : 4294942326 4294942326 remove
tcpdev_process : read 108 bytes
Retrans buffers : 1 retrans : 4294942331 4294942326not
IP : TCP packet
Retrans buffers : 1 retrans : 4294942331 4294942331 remove
tcpdev_process : read 108 bytes
Retrans buffers : 1 retrans : 4294942334 4294942331not
IP : TCP packet
Retrans buffers : 1 retrans : 4294942334 4294942334 remove

ON the client:

root@raspberrypi:/home/helge# telnet 10.0.2.15
Trying 10.0.2.15...
Connected to 10.0.2.15.
Escape character is '^]'.
login: rootassword:^M
Login incorrect

login: Password:^M
Login incorrect

login: Password:

Using netcat on the client side I at one point managed to login and get a shell prompt. It stopped (nethang, not system hang) after that (ps reporting the correct terminal).

  PID   GRP  TTY USER STAT CSEG DSEG  HEAP   FREE   SIZE COMMAND
    9     9   p0 root    S 5d34 41b0  1167   2708  67456 -/bin/sh

ghaerr commented 4 years ago

printk needs some CRs to improve readability, otherwise great.

What terminal emulator are you using to read the serial console? I'm running on macOS Terminal, which doesn't have this problem. Perhaps an option can be turned on that converts LFs to CRLFs.

If this is not easily possible, then I can look into always converting LF -> CRLF for printk's in the kernel. Let me know so this can be fixed ASAP for you.

ghaerr commented 4 years ago

First, this is using the latest commits, including #614 (basic fixes to enable network testing), correct?

With ongoing PING, sounds like there is nasty memory corruption in either the /dev/eth driver or ktcp. We will need to find out which by running ktcp with slip at some point soon.

rebooting w/o the ongoing PING, and instead telnetting into elks, leaves us with the following. It's not consistent in that it doesn't stop at the same place every time, and sometimes just loops in retransmits.

Sound like networking code is still buggy in other places.

The elks telnetd responds with login: but continues after a few seconds (doesn't wait for newline) and execs /bin/login with whatever was typed, and emits 'Password:', where it actually waits. The tty (pty) echoes control characters like '^M' etc.

That's exciting - that's the first time we've seen telnetd working on ethernet.

I pretty certain the ^M echoing is on your Pi side, check the stty options there.

Anyway, this is not bad at all.

Can you get logged in, or does the system crash afterwards? Suggest looking at ps to see if there's any memory left just before entering the password, the problem may be no more system memory. Run meminfo as well to see.

Mellvik commented 4 years ago

printk needs some CRs to improve readability, otherwise great.

What terminal emulator are you using to read the serial console? I'm running on macOS Terminal, which doesn't have this problem. Perhaps an option can be turned on that converts LFs to CRLFs.

If this is not easily possible, then I can look into always converting LF -> CRLF for printk's in the kernel. Let me know so this can be fixed ASAP for you.

I'm running MacOS terminal too, ssh to raspberrypi -> screen(1) to serial. Will take a look under the hood and follow up.

-M

Mellvik commented 4 years ago

First, this is using the latest commits, including #614 https://github.com/jbruchon/elks/pull/614 (basic fixes to enable network testing), correct?

Yes. With ongoing PING, sounds like there is nasty memory corruption in either the /dev/eth driver or ktcp. We will need to find out which by running ktcp with slip at some point soon.

Sees the ping is entirely repeatable. Bombs after 20 pings. rebooting w/o the ongoing PING, and instead telnetting into elks, leaves us with the following. It's not consistent in that it doesn't stop at the same place every time, and sometimes just loops in retransmits.

Sound like networking code is still buggy in other places.

The elks telnetd responds with login: but continues after a few seconds (doesn't wait for newline) and execs /bin/login with whatever was typed, and emits 'Password:', where it actually waits. The tty (pty) echoes control characters like '^M' etc.

That's exciting - that's the first time we've seen telnetd working on ethernet.

I pretty certain the ^M echoing is on your Pi side, check the stty options there.

Quite possible. It's different with NC, now - nc (as I'm using it here) is line oriented, not a good comparison. To be investigated. Anyway, this is not bad at all.

Can you get logged in, or does the system crash afterwards? Suggest looking at ps to see if there's any memory left just before entering the password, the problem may be no more system memory. Run meminfo as well to see.

Like the 2nd report indicated, yes, I can log in via netcat, even get a shell prompt. Nothing further. The shell does not seem to get any of the input.

—Mellvik

ghaerr commented 4 years ago

yes, I can log in via netcat, even get a shell prompt. Nothing further. The shell does not seem to get any of the input.

Run meminfo when you have the shell prompt connected via netcat. That will report how much free memory there is, in case this might be a memory problem. Another thought is to login as toor, and use sash. Unfortunately, sash is very big with all the builtins and is only 15k smaller than ash at this point. But might be interesting to compare.

Unfortunately I can't do replicate anything over here - running QEMU when I try "telnet localhost" nothing happens. I can't use an external program for testing!!

Mellvik commented 4 years ago

yes, I can log in via netcat, even get a shell prompt. Nothing further. The shell does not seem to get any of the input.

Run meminfo when you have the shell prompt connected via netcat. That will report how much free memory there is, in case this might be a memory problem. Another thought is to login as toor, and use sash. Unfortunately, sash is very big with all the builtins and is only 15k smaller than ash at this point. But might be interesting to compare.

Will do! With the recent memory improvements this may just be possible. Unfortunately I can't do replicate anything over here - running QEMU when I try "telnet localhost" nothing happens. I can't use an external program for testing!!

—

It would be extremely beneficial to get you closer into the Debugging loop. There must be a way to get this to work reasonably w qemu. I'm inclined to spend some time figuring that out.

Not knowing anything about the inside of ktcp, would it make sense to locate and fix the icmp echo problem to begin with, assuming it's on the 'outside' softwarewise as it is in the stack?

Btw, outgoing telnet (from elks) connects, then disconnects, leaving the calling shell in a loop, printing prompts. This only if started from the console. If started from serial, outgoing telnet just hangs. Seems to me there may be a stdio problem (to be tested) in both telnet situations. that when I get a shell from telnetd in nc on the client, that shell expects input from the console, not via the ttyp.

Again, given the improvements over just a few days, inbound & outbound telnet should be operational in a matter of days.

--M

ghaerr commented 4 years ago

It would be extremely beneficial to get you closer into the Debugging loop. There must be a way to get this to work reasonably w qemu. I'm inclined to spend some time figuring that out.

I just got incoming QEMU figured out! Have duplicated httpd running well, telnetd gets a shell and fails with double characters and other problems. Will be figuring this out and submitting PRs shortly. Still don't have outgoing working, so can't test telnet yet.

Not knowing anything about the inside of ktcp, would it make sense to locate and fix the icmp echo problem to begin with, assuming it's on the 'outside' softwarewise as it is in the stack?

QEMU supports host forwarding, I'll be working on seeing if I can ping ELKS. I took a quick look and ktcp is supposed to support ICMP echo.

Btw, outgoing telnet (from elks) connects, then disconnects, leaving the calling shell in a loop, printing prompts. This only if started from the console. If started from serial, outgoing telnet just hangs.

I can't test that yet, but already have a handful of problems I can finally duplicate :)

ghaerr commented 4 years ago

Hello @Mellvik,

An update on networking: I've deep-dived into it, and its a big mess. There's lots of work to be done, and I'm a bit hesitant to start, as it could take lots of time.

I have incoming-networking only working, so only tested that. Outgoing remains entirely untested, and internal (localhost) connections appear not to work at all.

Brief synopsis:

telnetd is completely broken. It doesn't implement any telnet sequences, its basically a stripped-down login-style daemon. As a result, it works terribly with any telnet-protocol compliant client, and reverts telnet clients to line mode, or double-echoing, which makes it unusable.
httpd appears to work well.
ktcp works well for a few minutes or connections, then fails with unreported memory allocation fails and subsequent use of NULL pointers. I've got that fixed, but there remains a big memory corruption problem that causes it to run out of memory improperly and stop executing after a few minutes.

Overall, ktcp is poorly written, unfinished, has corruption problems, and is married to the kernel /dev/tcp kluge within the socket implementation. We really need something like Phil Karn's well-written TCP/IP suite, but fear it would take lots of work to understand the kernel handling and subsequent passing to user code and back.

One scenario is to add significant good debugging into ktcp, but it suffers ultimately from never being finished, and writing TCP/IP network code from scratch is not a great idea these days. I fear it can't be made bug-free unless its memory corruption is tracked down. In addition, telnetd needs to be completely replaced with the MINIX 2 version, which I was going to do, but ktcp won't run long enough to be useful. Running outside applications using telnetd and httpd simultaneously will cause ktcp to crash within seconds.

Another possibility is to ditch ktcp and move to userland-only networking. This would be equivalent to running Karn's code or a micro TCP stack as a user program complete within itself, and stay out of the kernel, at least at the start. This could work great, but ultimately fails because we need sockets in the kernel to support arbitrary networking. Since ELKS is real mode, I'm not sure how important this is, versus getting real usable networking running.

Mellvik commented 4 years ago

@ghaerr, thanks for a thorough rundown and summary. Very enlightening indeed. Whatver the status is, now we have a baseline. (Which somewhat ironically reminds me of the BSD 4.1-4.2 transition: Integrate external networking code (from BBN) or develop from scratch - in hindsight they made the right choice - and created sockets (and more) to boot).

In our case that would be a choice between ktcp (somewhat workring) and Brian Kern (not entirely from scratch, but as I understand it, a clean kernel-integrated approach). Fully userland just doesn't taste good.

I appreciate the hesitation to take on the challenge of the Kern approach, and my choice - in the name of speed (although an assumption) would be to go for a cleanup of ktcp and get file transfer (either webdav via the current http server or a new ftpd) and a telnetd stable, leaving outgoing clients for now.

From a practical perspective, having two way file transfer operational would ease debugging alot, eliminating the make--move-floppy-to-ext-drive--dd-to-floppy--move-floppy-to-PC--boot'n'watch sequence much shorter and faster. Helped further by the newly added serial console flexibility.

My 2¢...

--Mellvik

An update on networking: I've deep-dived into it, and its a big mess. There's lots of work to be done, and I'm a bit hesitant to start, as it could take lots of time.

I have incoming-networking only working, so only tested that. Outgoing remains entirely untested, and internal (localhost) connections appear not to work at all.

Brief synopsis:

telnetd is completely broken. It doesn't implement any telnet sequences, its basically a stripped-down login-style daemon. As a result, it works terribly with any telnet-protocol compliant client, and reverts telnet clients to line mode, or double-echoing, which makes it unusable. httpd appears to work well. ktcp works well for a few minutes or connections, then fails with unreported memory allocation fails and subsequent use of NULL pointers. I've got that fixed, but there remains a big memory corruption problem that causes it to run out of memory improperly and stop executing after a few minutes. Overall, ktcp is poorly written, unfinished, has corruption problems, and is married to the kernel /dev/tcp kluge within the socket implementation. We really need something like Phil Karn's well-written TCP/IP suite, but fear it would take lots of work to understand the kernel handling and subsequent passing to user code and back.

One scenario is to add significant good debugging into ktcp, but it suffers ultimately from never being finished, and writing TCP/IP network code from scratch is not a great idea these days. I fear it can't be made bug-free unless its memory corruption is tracked down. In addition, telnetd needs to be completely replaced with the MINIX 2 version, which I was going to do, but ktcp won't run long enough to be useful. Running outside applications using telnetd and httpd simultaneously will cause ktcp to crash within seconds.

Another possibility is to ditch ktcp and move to userland-only networking. This would be equivalent to running Karn's code or a micro TCP stack as a user program complete within itself, and stay out of the kernel, at least at the start. This could work great, but ultimately fails because we need sockets in the kernel to support arbitrary networking. Since ELKS is real mode, I'm not sure how important this is, versus getting real usable networking running.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

pawosm-arm commented 4 years ago

In our case that would be a choice between ktcp (somewhat workring) and Brian Kern (not entirely from scratch, but as I understand it, a clean kernel-integrated approach). Fully userland just doesn't taste good.

For me it's not a matter of taste, it's the practicality. Fully userland solution isn't that bad and isn't that uncommon, especially if you're limited in space for kernel. Shifting the burden to a daemon running in userspace is the solution then.

ghaerr commented 4 years ago

Update on networking status, and thanks for both your comments.

After careful consideration of the work involved, I decided to stick with ktcp and attempt to corral the memory problems. Also, I retract my statement that ktcp is poorly written. The architecture, now that I'm coming to understand it, is nicely albeit crudely written, though definitely unfinished. Not all memory allocations were checked, and there were no provisions for handling unbounded memory growth. There is no routing, very little IP, no UDP. TCP doesn't handle error cases well.

I've now got ktcp able to run for longer periods of time, have fixed a big memory overwrite bug and have added code to prevent it from running out of memory. I have tracked down the unbounded memory growth problem, which happens for a couple reasons: 1) for reasons yet unknown, TCP gets out of sync, and unlimited packets are allocated for retransmission, which grew to a very large size and malloc fails, and 2) TCP close wait operations with httpd do a FIN wait forever, and never close the connection. Since httpd thinks the connection is terminated, it starts a new one for the next browser refresh, each LISTEN/ESTABLISH operation requires 2106 bytes, and ktcp runs out of memory. I hacked the above problems into submission by forcing a ktcp TCP reset on all connections when too many packets are awaiting retransmission, and ktcp now at least stays running.

telnetd now runs for minutes at a time, so it now makes sense to port a real telnetd and that will fix our telnetd problem, as the current implementation can't work with a real telnet client. ktcp doesn't recognize a shell exit or connection close for some reason, and subsequent telnet requests don't work.

httpd will run 3-4 browser refreshes before ktcp runs out of memory on new listen operations. Also, httpd zombies remain and ELKS runs out of process slots.

All of this is being tested with NE2K card emulation on QEMU. I badly need a method of testing SLIP/CSLIP to determine at which level the bugs are. Can anyone help to setup a test SLIP scenaro using an emulator?

I plan on submitting a few cleanup PRs that should allow testing volunteers to help move this forward.

ghaerr commented 4 years ago

From a practical perspective, having two way file transfer operational would ease debugging alot, eliminating the make--move-floppy-to-ext-drive--dd-to-floppy--move-floppy-to-PC--boot'n'watch sequence much shorter and faster.

That's a good idea - we need a tftp implementation. I'm looking at MINIX, our telnet is based on MINIX, our telnetd will be. I'm thinking of porting rlogin. But MINIX doesn't have any tftp. Any suggestions on a very basic implementation that would be easy to port?

It would be nice to then write a script that automatically network copied a full disk image over to a PC running ELKS, wrote it to floppy, and booted it.

I ran across an unused older PC at the office, and thought "finally - I can test serial!". However, its got 2 CDs and a 3.5" floppy - I have no way to boot ELKS on it!

Helped further by the newly added serial console flexibility.

Thanks for using the new tools. They take quite a bit of time to implement, but I'm finding they really help saving time finding bugs and moving ELKS forward.

Mellvik commented 4 years ago

That's a good idea - we need a tftp implementation. I'm looking at MINIX, our telnet is based on MINIX, our telnetd will be. I'm thinking of porting rlogin. But MINIX doesn't have any tftp. Any suggestions on a very basic implementation that would be easy to port?

Maybe 4.2bsd or freebsd versions would do? It would be nice to then write a script that automatically network copied a full disk image over to a PC running ELKS, wrote it to floppy, and booted it.

Spot on! I ran across an unused older PC at the office, and thought "finally - I can test serial!". However, its got 2 CDs and a 3.5" floppy - I have no way to boot ELKS on it!

Is there a network card? If yes, and there is a packet driver for the i/f, you're ok - via msdos and MTCP. If not, you need a usb floppy ...

-M

pawosm-arm commented 4 years ago

I ran across an unused older PC at the office, and thought "finally - I can test serial!". However, its got 2 CDs and a 3.5" floppy - I have no way to boot ELKS on it!I ran across an unused older PC at the office, and thought "finally - I can test serial!". However, its got 2 CDs and a 3.5" floppy - I have no way to boot ELKS on it!

I had similar issue with Amstrad PC 2086, it had only 3.5'' double density (720k) floppy drive, so it was a challenge to boot it before ELKS hd images became fully bootable. Fortunately, I could buy USB floppy drive and connect it to Linux PC. Surprisingly, I found it harder to find 720kB floppies than the drive itself. Still, booting old PC from CF card connected to CF-IDE adapter is a much more reliable solution.

ghaerr commented 4 years ago

you need a usb floppy ...

Would that be a BIOS option, or have you found 15-year old systems will boot from USB automatically if no CD or FD is inserted?

I'm guessing one could just use a FAT USB stick and dd an ELKS FAT boot image onto it?

Mellvik commented 4 years ago

I plan on submitting a few cleanup PRs that should allow testing volunteers to help move this forward.

Looking forward to it.

Btw - not urgent but the ability to set the ethercard irq in config would simplify testing on real hw. There is a generation of great ne2k cards out there that refuse both irq2 and irq9.

-M

Mellvik commented 4 years ago

you need a usb floppy ...

Would that be a BIOS option, or have you found 15-year old systems will boot from USB automatically if no CD or FD is inserted?

I'm guessing one could just use a FAT USB stick and dd an ELKS FAT boot image onto it?

To connect to your mac to generate bootables.

-M

ghaerr commented 4 years ago

the ability to set the ethercard irq in config would simplify testing on real hw. There is a generation of great ne2k cards out there that refuse both irq2 and irq9.

That's already done - just edit NE2K_IRQ in include/arch/ports.h and also set the CONFIG_NEED_IRQx under the CONFIG_ETH_NE2K define.

Try it out!

The only thing not done for configurable ports to be completed is someone needs to edit drivers/net/ne2k-mac.s to offset each of the ne2k registers from a NE2K_BASE value (default 0x300). I couldn't do it before since I couldn't test it.

Mellvik commented 4 years ago

All of this is being tested with NE2K card emulation on QEMU. I badly need a method of testing SLIP/CSLIP to determine at which level the bugs are. Can anyone help to setup a test SLIP scenaro using an emulator?

In this case VB could be your simplest choice. Serial is easy, netcat your friend, getting a dependable 'other end' to connect to may be the biggest challenge. CSLIP on macos may be hard, cslip in a 2nd VB instance running linux, maybe even DOS easier...

I can take a stab at setting up something like this later in the coming week. And export the images to you.

-M

Mellvik commented 4 years ago

The only thing not done for configurable ports to be completed is someone needs to edit drivers/net/ne2k-mac.s to offset each of the ne2k registers from a NE2K_BASE value (default 0x3000). I couldn't do it before since I couldn't test it.

If you make the changes required, I'll test it.

-M

cjsthompson commented 4 years ago

Is there something still left to do for being able to change ne2k's IRQ to something else than 9? For some reason PCem doesn't allow to set it's ne2k to IRQ9, only 3, 5, 7, 10, 11 and 12. I've tried editing ports.h to 10 but that didn't work. Then 7 and I get a panic at boot (I did not compile in parallel port support in the kernel).

ghaerr commented 4 years ago

Is there something still left to do for being able to change ne2k's IRQ to something else than 9?

No, it should be done, but is untested.

I've tried editing ports.h to 10 but that didn't work.

Did you change both CONFIG_NEED_IRQ10 and NE2K_IRQ?

Perhaps try use IRQ 3 in case PCem thinks your running on XT hardware and the 8259 cascading isn't working properly. Make sure you don't have a COM2 hw ad/nor set COM2_IRQ go something else, say 5.

I did not compile in parallel port support in the kernel).

The parallel port code doesn't use any interrupts, so this panic must have been for a different problem with IRQ 7.

I will try changing the IRQ to 10 with QEMU and see if 10 will work for me.

cjsthompson commented 4 years ago

Did you change both CONFIG_NEED_IRQ10 and NE2K_IRQ?

Yes, I did change both.

Perhaps try use IRQ 3 in case PCem thinks your running on XT hardware and the 8259 cascading isn't working properly. Make sure you don't have a COM2 hw ad/nor set COM2_IRQ go something else, say 5.

I did not compile in parallel port support in the kernel).

The parallel port code doesn't use any interrupts, so this panic must have been for a different problem with IRQ 7.

I will try changing the IRQ to 10 with QEMU and see if 10 will work for me.

Ok, I'll try IRQ 3, see how it goes.

cjsthompson commented 4 years ago

Same problem with IRQ3:

Screenshot_2020-05-12_17-01-28 Screenshot_2020-05-12_17-01-39

ghaerr commented 4 years ago

The panic is strange, as the init_IRQ procedure that can calls that panic() is called very early in the ELKS boot process. Indicative of complete loss of control. I think I remember seeing this same problem from something @Mellvik was getting though.

It is certainly possible that the IRQ remapping code in general isn't working, I'm struggling to get a QEMU test setup for it over here.

I'm trying to get QEMU to accept another irq= line, but, as usually, QEMU's options are insanely complicated and the current -nic command won't accept an IRQ, so I'm still trying to get that tested.

ghaerr commented 4 years ago

Hello @cjsthompson,

I finally untangled the QEMU command line to get it to work with a configurable IRQ line for networking. I tested it for IRQ 9, then changed it to IRQ 3, did a make clean; make, and everything works.

Here's the diff:

--- a/elks/include/arch/ports.h
+++ b/elks/include/arch/ports.h
@@ -40,7 +40,7 @@

 #ifdef CONFIG_CHAR_DEV_RS
 #define CONFIG_NEED_IRQ4               /* COM1*/
-#define CONFIG_NEED_IRQ3               /* COM2*/
+//#define CONFIG_NEED_IRQ3             /* COM2*/
 //#define CONFIG_NEED_IRQ5             /* COM3*/
 //#define CONFIG_NEED_IRQ2             /* COM4, XT only*/
 #endif
@@ -51,7 +51,8 @@
 //#define CONFIG_NEED_IRQ8

 #ifdef CONFIG_ETH_NE2K
-#define CONFIG_NEED_IRQ9
+//#define CONFIG_NEED_IRQ9
+#define CONFIG_NEED_IRQ3
 #endif

 /* unused*/
@@ -97,7 +98,7 @@

 /* ne2k, eth-main.c*/
 #define io_ne2k_command 0x0300         /* FIXME needs to be included in ne2k-mac.s*/
-#define NE2K_IRQ       9
+#define NE2K_IRQ       3
 #define NE2K_PORT      0x300

Since we're both testing on emulators, perhaps PCem still thinks IRQ is used for other emulated hardware? After moving to IRQ 3 and everything working, I also tested QEMU with irq=9, and it fails, as expected.

Does PCem fail immediately on boot, or only after a bit? I can't tell from the dual screenshots. The network card interrupt processing is pretty straightforward, and contained in drivers/net/eth-main.c.

Here is the networking portion of the QEMU command line used to get this working, should you want to play with that. I can send you an updated qemu.sh if you wish.

NET="-netdev user,id=mynet0,hostfwd=tcp:127.0.0.1:2323-10.0.2.15:23 -device ne2k_isa,irq=3,netdev=mynet0"

cjsthompson commented 4 years ago

That may well be the problem, since IRQ 3 is for serial ports and you can't disable joystick support maybe IRQ3 is always used for the serial port of the joystick. I selected Amstrad mouse so this doesn't use a serial port, but may use another IRQ too. IRQ5 is used by the hard drive controller. IRQ 7 doesn't work even if I disable LPT support. I don't think an XT has more than IRQ8. But I could be wrong about that. But when I tried IRQ10 it didn't work anyway. So this may be a PCem limitation: either network card or hard drive controller but not both. I think I read that the XT-CF/XT-IDE controllers do not use an IRQ, so having both a hard drive and a network card should be possible at least on real hardware. The XT-IDE controller is emulated in PCem but for some reason it doesn't work with the Amstrad.

So I guess I'm going to have to go with Qemu. The shell script would be helpful.

Thanks,

ghaerr commented 4 years ago

So I guess I'm going to have to go with Qemu. The shell script would be helpful

Here's the script (qq.txt) I'm currently working on, it will be integrated into qemu.sh shortly. The line you can use for testing is the uncommented NET= line. This script allows host->ELKS (host telnet -> ELKS telnetd and host browser -> ELKS httpd). I haven't yet figured out how to get qemu to forward telnet connect packets from ELKS out to the host, although from some other comments in this script, it is supposed to.

I also use (qs.txt) script to access ELKS serial ttyS0 login, it is quite handy. It is the same except the CONSOLE= and SERIAL= lines are different, and it uses the older HOSTFWD line which doesn't allow network card irq remapping. This script is very useful to redirect kernel console to serial, which can be done by editing the rootfs_tempate/bootops file.

Remove the .txt extension and run by ./qq.

Let me know whether these work for you, as qemu has several versions out, and their command lines aren't compatible. Ugh.

qq.txt qs.txt

Mellvik commented 4 years ago

First: ktcp still introduces 1 or two (false) keyboard characters when started. FWIW: the appearance of these characters at the login prompt is preceded by seemingly legitimate keyboard interrupts. I discovered this when I had debugging active in xt_key.c. Different from before is that these two characters now also appear when ktcp is started from the command line. Possibly even more interesting: This phenomenon disappears when the IRQ is changed to something other than 9. I have on my list to test this on a different type of machine (which in my case means a Portable 386).

Scenario: boot w/o autostarting ktcp

Manual start: # ktcp 10.0.2.15 /dev/eth 10.0.2.2 255.255.255.0 &
Ping from the outside works - but only 16 times. Then ktcp dies and needs to be restarted. Kill works.
Restart ktcp, ping succeeds 9 times. Now the systems locks up hard (powercycle). No console messages.
Restart, no pinging this time, start telnetd
- incoming telnet connects, then disconnects, console printk: tcp: RETRANS limit, timeruse 7 [this is consistently repeatable, timeruse may vary between 6 and 7, netstat says  2 CLOSED 3562ms 23 10.0.2.2 58040  There is now a new /bin/login process w/ttyp0 in the process table as expected..
- 2nd attempt also connects, but hangs (client side). Netstat reports SYN_RECEIVED and stays that way
- 3rd attempt fails (client side), reporting no route to host (i.e. no reply).
- Netstat on elks says 'Retransmit memory 1 bytes' and the listing is empty.
- At this point the system hangs (hard hang, requiring power cycle) most og the time.
- At one point, after telned failed like above, httpd was started, resulting in a ktcp: panic in read message, and ktcp exited.
- Restarting ktcp at this point did not restore any connectivity.
Outgoing telnet needs an external wakeup - as pointed out by @ghaerr - in order to work at all. Here's what I did:
- reboot, start ktcp manually
- ping from the outside, abort after just a couple of replies
- Telnet out and I get login prompt from the remote system
- enter and the process hangs. A few seconds later I get tcp: RETRANS limit, timeruse 6
- After this, the RETRANS message is repeated for every 5th char typed into telnet, no echo.

Finally: I've tested ethernet hardware with a mix of IRQs and IO addresses. It works - and it's (still somewhat) complicated. First it took me a while to figure out that ne2k-phy.s has io-addresses hardcoded. Then - having trouble with interrupts - I finally found that I'd forgotten the CONFIG_NEEDIRQ define, thinking that setting the NE2K defines would do. Since this is not something any of us do regularly, it is easy to miss something. In other words, still an issue for improvement AFAIK.

--Mellvik

ghaerr commented 4 years ago

Hello @Mellvik!

Wow, you've covered a lot of ground here :) I have a number of thoughts, I think I'll respond in several comments, rather than mixing the many topics to consider.

ktcp still introduces 1 or two (false) keyboard characters when started. FWIW: the appearance of these characters at the login prompt is preceded by seemingly legitimate keyboard interrupts. I discovered this when I had debugging active in xt_key.c. Different from before is that these two characters now also appear when ktcp is started from the command line. Possibly even more interesting: This phenomenon disappears when the IRQ is changed to something other than 9.

Thanks to your also having debugging on in xt_key.c, I think we're on to something, finally.

With the kbd IRQ at 1, and NE2K at 9, 8 away, it seems something fishy may be going on with your system or 8259 controller. I definitely suggest testing networking on your Portable 386, to see whether this IRQ 1/9 issue presents itself.

Meanwhile, I have discovered more problems and solutions and have another commit ready to go, but was holding until your first round of testing, as I didn't want to break things until I heard what you found. Given the above, I suggest that you run on another interrupt (lower than 8 if possible?) on this machine, as this could possibly be causing other issues. That said, there are lots of other things going on... which I'll explain.

Finally: I've tested ethernet hardware with a mix of IRQs and IO addresses. It works - and it's (still somewhat) complicated.

Thanks! Yes, agreed it's still complicated. After we get it fully working, we can think of more ways to make it easier to use.

First it took me a while to figure out that ne2k-phy.s has io-addresses hardcoded.

ne2k-phy.s is an old BCC asm file, it isn't actually being assembled or used at all. It should be deleted, but hasn't yet because it might be needed for reference.

Are you talking about ne2k-mac.s? I recently added the base= option, did you use that to arrange for another base address? Anything else need to be modified, like tx_first_word?

Then - having trouble with interrupts - I finally found that I'd forgotten the CONFIG_NEEDIRQ define, thinking that setting the NE2K defines would do.

The NE2K_PORT value still isn't passed to ne2k-mac.s, because it has to be converted to a .S file so that #defines work... more processing needed ,and will do in the next step after confirmation that the existing file can be changed and made to work with base=, etc.

Outgoing telnet needs an external wakeup - as pointed out by @ghaerr - in order to work at all.

The reason for this is that a ktcp was modified a few years back when sending an ARP request, to busy-loop until a reply is received - and that's exactly what it does - hang the system when no packet arrives. I've solved this "synchronous" ARP problem, which is the subject of a 3-year-old issue #67, in a commit forthcoming. We need to have a well-working network in order to test it. With this fix in, there is no longer a need to "externally wake" ktcp, as what is happening is an ARP packet does arrive, just not immediately after the request.

This all sounds like a long road, but I've made big progress on my end, (emulator only), now having telnet and telnetd working both in and out, which previously, due to lots of reasons, only appeared to work, but was actually quite buggy. I think I almost fully understand ktcp, and look forward to getting it all (re)working nicely.

ghaerr commented 4 years ago

Ping from the outside works - but only 16 times. Then ktcp dies and needs to be restarted. Kill works.

Lets start here at the basics - there is something wrong with either your particular test system with the IRQ issues, or in ktcp. Ktcp has no UDP support, very little IP support, and literally only a few lines of ICMP support - to return an ECHO request. No memory is allocated or released by any of these routines in ktcp (except in the convoluted ktcp<->sockets code), so this should be easily debugged. I will post a PR later today that will provide complete ARP and IP packet incoming and outoing printing support, which should allow us to see a complete log on serial console, which can then be analyzed or forwarded.

Restart ktcp, ping succeeds 9 times.

I just found another bad pointer dereference, but only for TCP RST processing. So - whenever ktcp dies, we have to restart the system, there are too many variables introduced otherwise. The ptr fix will be in my next commit with the enhanced packet printing and ARP fix.

Looking at your test log, it seems to me that the basic networking is not working, which I now have working on my system (it did not work initially at all). I'm thinking that there is a problem at the ethernet/link level. The RETRANS code (which retransmits TCP packets when timers expire), only comes to play when packets are lost, or TCP connections don't acknowledge sequence numbers properly, that kind of thing. Because of other bugginess, this was happening a lot in basic operation, and ktcp used to run out of memory because RETRANS code wasn't checking. That should all be fixed, except for connection close issues and occasional long data transmits. Thus my conclusion about thinking you should test on another system or on another IRQ.

ghaerr commented 4 years ago

@Mellvik, on yet another note, I have found a micro TFTP client and server which look nice and simple, and it could be fairly straightforward to port (https://github.com/labcoder/simple-tftp). It uses fork to read/write the sockets, and that would probably have to be converted to a single-process select approach like is currently used in telnet/telnetd, but not sure.

However, I didn't realize that the TFTP protocol requires UDP - and ktcp does not implement that. The good news is this micro TFTP client/server uses TCP, so it could work great. The drawback is that we could have to use only these versions, as its not technically TFTP compatible.

The client would also have to be ported to Linux and OSX, which is more work. But this could be a way to get network file transfer working between host and ELKS. Perhaps getting the client and server working first on Linux or OSX would be a good option, for testing? What do you think?

Mellvik commented 4 years ago

Thanks to your also having debugging on in xt_key.c, I think we're on to something, finally.

With the kbd IRQ at 1, and NE2K at 9, 8 away, it seems something fishy may be going on with your system or 8259 controller. I definitely suggest testing networking on your Portable 386, to see whether this IRQ 1/9 issue presents itself.

Will do. I'll also check if this happens - on the current system - using BIOS console.

Finally: I've tested ethernet hardware with a mix of IRQs and IO addresses. It works - and it's (still somewhat) complicated.

Thanks! Yes, agreed it's still complicated. After we get it fully working, we can think of more ways to make it easier to use.

First it took me a while to figure out that ne2k-phy.s has io-addresses hardcoded.

ne2k-phy.s is an old BCC asm file, it isn't actually being assembled or used at all. It should be deleted, but hasn't yet because it might be needed for reference.

OK - another case of changing several parameters concurrently and not really knowIng which one(s) did the trick...

This all sounds like a long road, but I've made big progress on my end, (emulator only), now having telnet and telnetd working both in and out, which previously, due to lots of reasons, only appeared to work, but was actually quite buggy. I think I almost fully understand ktcp, and look forward to getting it all (re)working nicely.

I guess that depends on the angle. I figure we're almost there since it (mostly) works on virtual. Very encouraging actually.

Looking forward to more fixes ...

--Mellvik