Serial port access unreliable in ELKS

pawosm-arm commented 4 years ago

I've seen this problem on two very different XT machines: Turbo-XT with 8bit ISA card (two bare 8250's on it) and Amstrad PC 2086 (on-board Amstrad 40049 UART, the same as in Amstrad Portable PC models). Communication with other PC running full blown Linux is haunted by lost or malformed characters and using this port for a Microsoft Serial Mouse causes nano-X to hang entire machine. Contrary to this, Telix running under FreeDOS can go as fast as 115200bps on both XT machines.

ghaerr commented 4 years ago

The system is completely dead and unresponsive after the crash, pressing Esc has no effect, it must be kernel that hung.

I am wondering whether the mouse was sending ^C and ^Z characters and the system thought that the serial port had a controlling TTY, and sent signals... I recently fixed signal handling (yesterday's commit), it would be interesting to know whether this still happens, or whether this is a bug in the input processing of the serial driver.

ghaerr commented 4 years ago

A good test to determine whether signal handling is the problem for nano-X or other applications would be for you to comment out the following lines in 'elks/arch/i86/driver/char/ntty.c':

    if (key == ttyp->termios.c_cc[VINTR])
        sig = SIGINT;
    if (key == ttyp->termios.c_cc[VSUSP])
        sig = SIGTSTP;

A possible problem is that otherwise, all programs would have to either set raw tty mode, or ignore signals with an explicit call to signal(..., SIG_IGN), which most do not.

I am thinking of removing ^Z/SIGTSTP signal generation in ntty.c in ELKS for this reason.

pawosm-arm commented 4 years ago

I am wondering whether the mouse was sending ^C and ^Z characters and the system thought that the serial port had a controlling TTY, and sent signals... I recently fixed signal handling (yesterday's commit), it would be interesting to know whether this still happens, or whether this is a bug in the input processing of the serial driver.

I've tried current master, nano-X still hangs, yet this time I pressed Ctrl+C, the on-screen blob got bigger (as if there was a message outputed in it, though it can't be read as the blob is filled with stripes). Pressing Ctrl+C again and again, I've filled entire screen with the blob. I've tried to type reboot command, yet with no success. The kernel isn't entirely hung then, yet nano-X (landmine game) can't return to shell.

ghaerr commented 4 years ago

I see. Sounds like it could be the shell returning after ^C, and displaying (badly) on screen.

Have you tried exiting nano-X with ESC key instead of ^C? that should bring back normal text screen.

ghaerr commented 4 years ago

And it could be that typing 'reboot' doesn't work since the tty is in raw mode. You'd need to try 'reboot' follow by control-J.

pawosm-arm commented 4 years ago

Have you tried exiting nano-X with ESC key instead of ^C? that should bring back normal text screen.

I've tried. Before nano-X is hung, Esc returns to shell (normal 80x25 text mode) properly. When nano-X is hung, Esc has no effect.

And it could be that typing 'reboot' doesn't work since the tty is in raw mode. You'd need to try 'reboot' follow by control-J.

Nah. I've tried to press Esc, then type reboot, pressed Ctrl-J, no effect (tried Ctrl+M too as on some system Enter sends Ctrl+M to the serial console, no effect either). Also note that Ctrl+Alt+Del does not work either (it should force the reset), I had to perform hardware reset.

A good test to determine whether signal handling is the problem for nano-X or other applications would be for you to comment out the following lines in 'elks/arch/i86/driver/char/ntty.c'

Following your suggestion, I've remarked entire if branch:

int tty_intcheck(register struct tty *ttyp, unsigned char key)
{
    sig_t sig = 0;

/*
    if ((ttyp->termios.c_lflag & ISIG) && ttyp->pgrp) {
        if (key == ttyp->termios.c_cc[VINTR])
            sig = SIGINT;
        if (key == ttyp->termios.c_cc[VSUSP])
            sig = SIGTSTP;
        if (sig) {
            debug_sig("TTY signal %d to pgrp %d pid %d\n", sig, ttyp->pgrp, current->pid);
            kill_pg(ttyp->pgrp, sig, 1);
        }
    }
*/
    return sig;
}

...and the blob is gone. Now when nano-X is hung, it's frozen, no sign of life.

I pressed Ctrl+C, the on-screen blob got bigger (as if there was a message outputed in it, though it can't be read as the blob is filled with stripes).

...so I guess the message was "TTY signal %d to pgrp %d pid %d\n".

ghaerr commented 4 years ago

Thanks for your report. My conclusion is that there is a bug in the serial driver, so we'll leave this issue open.

...so I guess the message was TTY signal %d to pgrp %d pid %d\n.

No, this message is only displayed when DEBUG_SIG is set on in linuxmt/debug.h. So I think the message was the shell prompt redisplaying, which happens because a SIGINT was sent, and that killed nano-X. (I will fix that issue separately in nano-X, which needs to set the tty driver to RAW mode, which I thought it did, otherwise signals could be sent).

I'm not sure the reason ^J, reboot or otherwise doesn't do anything, I suspect either the tty driver is hung, or ELKS is crashed. In either case this issue remains a bug.

ghaerr commented 4 years ago

@pawosm-arm : I think a better test of the serial driver might be to eliminate the variable of having nano-X running graphics mode entirely, and just test using cat /dev/ttyS0 and running this both with and without the commenting out of signals in ntty.c. In this way, we can determine 1) if the serial driver is sending signals when it shouldn't be (a ^C would terminate cat with the comments out), and 2) if the serial port hangs with no possibility of signals (hang in cat with comments in).

Thanks for your testing. I can't get it to repeat in QEMU.

pawosm-arm commented 4 years ago

Strangely, cat /dev/ttyS0 breaks to the shell (after a few mouse moves, which prints lots of terminal control characters) no matter if I comment-out that if or not.

ghaerr commented 4 years ago

That is very strange! Turn on signal debugging by setting

#define DEBUG_SIG   1       /* signals*/

in elks/include/linuxmt/debug.h and you'll be able to see what signals (if any) are being generated. ps can be used to look at the pid and whether there's a controlling TTY (run cat in background then ps to see).

You'll need to make clean; make to recompile everything because debug.h isn't a dependency on kernel includes (I thought kernel dependencies were automatically generated, but I guess not).

pawosm-arm commented 4 years ago

You'll need to make clean; make to recompile everything because debug.h isn't a dependency on kernel includes (I thought kernel dependencies were automatically generated, but I guess not).

I always do make clean (just in case).

Turn on signal debugging by setting

Now cat /dev/ttyS0 prints lots of SIG related messages ending as such:

TTY open pid 17

When cat breaks to the shell it prints:

SIGNAL kill sig 12 pid 14

Pid 14 is the shell.

pawosm-arm commented 4 years ago

^^ it is with comments removed, the if in the ntty.c file is enabled.

ghaerr commented 4 years ago

The TTY open is normal, more importantly would be whether any "SIGNAL process" are displayed for cat, which is 17.

The SIGNAL kill sig 12 pid 14 is the shell getting a SIGCHLD termination signal from cat exiting, which is normal.

I'm sorry, but I now realize that running cat without setting into raw mode will definitely cause cat to exit as soon as a ^D character is received, (and signals are likely processed, but that's moot with ^D). Thus, we need to look into stty to see how to turn off input handling. Thus, a shell script must be prepared, that runs stty raw < /dev/ttyS0; cat /dev/ttyS0.

Since stty -raw isn't supported, I'll have to look at how to get the terminal back into non-raw mode, and I just tried exiting the shell and found yet another bug in /bin/[getty,login], which don't reset the terminal state, so you can't log back in!

Looks like more things to fix! I will have to look in to nano-X to see whether it is properly setting raw mode to determine whether this issue is actually a tty driver raw settings bug, a signal bug, or a tty driver input character speed issue. If you can figure out getting cat to test well, that will still help, but its more complicated than I thought, I'm sorry.

pawosm-arm commented 4 years ago

Setting raw tty was a good idea: it does not break or hang (I've also tried various serial speeds). Of course it needs reboot to restore usable tty (for the reasons you mentioned).

I'm observing two problems here: serial port stability (causing serial terminal connections and SLIP connections inherently unstable, and resulting in mouse arrow sudden jumps all over the nano-X screen) and nano-X crashing due to sudden mouse moves (caused by unstable serial port readings).

ghaerr commented 4 years ago

serial port stability (causing serial terminal connections and SLIP connections inherently unstable, and resulting in mouse arrow sudden jumps all over the nano-X screen)

Ok - so is this the primary repeatable problem for this open issue then?

Can we get more detailed information about the instability other than through nano-X (NX)?

I have a program that translates mouse data to printf() events, if you'd like to play with that.

and nano-X crashing due to sudden mouse moves (caused by unstable serial port readings).

I still don't understand this one - NX should never crash with bad mouse input, unless for some reason it gets stuck in an internal loop waiting for correct data, but I don't think so. Are you seeing the result that NX exits (and displays garbage on screen)? Bad mouse data still shouldn't crash NX.

Setting raw tty was a good idea: it does not break or hang (I've also tried various serial speeds).

It seems that the NX mouse problem is a possibly separate issue then, since you can't get ELKS to hang, crash or exit using cat.

pawosm-arm commented 4 years ago

Ok - so is this the primary repeatable problem for this open issue then? Can we get more detailed information about the instability other than through nano-X (NX)?

As I wrote before. When connected to the Linux on big PC, I'm observing some characters being lost (in normal serial terminal connection, e.g. the inittab on ELKS has getty redirected to /dev/ttyS0 and minicom is running on Linux). When I tried to create SLIP connection between ELKS and Linux and started the telnet daemon on ELKS, a telnet session connected from Linux on PC freezes after short while of using it. I've been trying various speeds.

I have a program that translates mouse data to printf() events, if you'd like to play with that.

It's worth to give it a try. Yet I never tried to cross-compile anything for ELKS outside of its repo directory tree.

Mellvik commented 4 years ago

Hi @pawosm-arm, I'm curious about this issue for a number of reasons. You mentioned in an early post that you can get 115200 bps on some 8086 hardware. Can you tell more about that hardware? My math doesn't add up on this - on a 8086.

In my experience 19200 is very fast on these klunkers (including my 286 machines), and character loss frequent even at that speed, unless we're using programmed i/o and no disturbance from interrupts. When debugging serial connections it is my recommendation to use 4800 bps in order to eliminate overrun errors disturbing the test. Unless you have reliable hw handshake of course... I have no idea if the ELKS driver supports that.

--Mellvik

mar. 2020 kl. 19:55 skrev pawosm-arm notifications@github.com:

Ok - so is this the primary repeatable problem for this open issue then? Can we get more detailed information about the instability other than through nano-X (NX)?

As I wrote before. When connected to the Linux on big PC, I'm observing some characters being lost (in normal serial terminal connection, e.g. the inittab on ELKS has getty redirected to /dev/ttyS0 and minicom is running on Linux). When I tried to create SLIP connection between ELKS and Linux and started the telnet daemon on ELKS, a telnet session connected from Linux on PC freezes after short while of using it. I've been trying various speeds.

I have a program that translates mouse data to printf() events, if you'd like to play with that.

It's worth to give it a try. Yet I never tried to cross-compile anything for ELKS outside of its repo directory tree.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ghaerr commented 4 years ago

Hello @pawosm-arm,

PR #459 adds a test mouse program, which decodes Microsoft and PC mice data, and displays on the console. It is setup just like Nano-X, so we can determine whether the problem is in Nano-X driver (identical to this one), or the serial hardware or ELKS serial driver. Note, that by default this program is setting 1200 baud and /dev/ttyS0. If your mouse does not respond, you may need to edit 'elkscmd/sys_util/mouse.c' and recompile. We'll get this debugged yet!

This would also be a good way to look into @Mellvik's statement about whether or not high baud rates are supported, although looking at 'include/linuxmt/termios.h' shows that only rates up to 38400 are supported by the kernel.

[EDIT: Looks like baud rates as high as 230400 are settable.]

The mouse program works well under QEMU, I can't get it nor Nano-X to fail, nor lose data.

pawosm-arm commented 4 years ago

Hi @Mellvik,

Those bps speeds you quoted seem slightly disappointing, my ZX Spectrum +3 with bit-banging RS232 interface can do better than that on CP/M Plus.

The mode command in FreeDOS started on 8088 and 8086 machines allows to set speed up to 19200bps as it talks to BIOS hence it is limited by that. Similarly device command in CP/M Plus started on Z80 machines allows to set up to 19200bps, yet since RS232 interface on ZX Spectrum +3 is not interrupt driven, working at such speed is limited to write only. When configuring Telix (on FreeDOS), one needs to enter hardware configuration (I/O ports, IRQ) so it can be assumed it bypasses the XT-BIOS and talks to the hardware directly. Also, Telix allows setting hardware flow control. Speed 115200bps needs to be set on both sides (Telix on 8088/8086 side, minicom on Linux side), and it works just fine! One thing I did not mention is that I prepared null-modem cable myself and I've corssed-over not only the data lines but also status lines, so RTS/CTS flow control is done in hardware. The 'miniterm' command on ELKS allows setting speed up to 57600 (it clamps larger values, problem with parsing larger integers on 16-bit machine?). Sending to Linux PC (running minicom) works just fine, even at that speed. Receiving data result in hanging miniterm (and in effect the entire machine) after a while. I did some experiment and found out that varying connection speed or giving -r option (RTS/CTS hardware flow control) does not matter, the difference is that with slower speed characters got missing less frequently, yet the crash is still there. And it's not deterministic how long it takes for this to crash.

I suspect serial port IRQ handling is somehow broken and causes overruns and crashes.

pawosm-arm commented 4 years ago

@ghaerr your mouse program prints values for a while (both negative and positive for x and y, 0..3 for buttons a while after I press them), yet everything hangs soon after. Also I noted that those readings are terribly delayed: moving a mouse makes flood of x and y changes for longer than I move it, and I need to wait some time to see effects of pressing the buttons.

ghaerr commented 4 years ago

Hello @pawosm-arm, thank you for testing with the mouse program.

IMO, your testing confirms definite problems in the serial port driver. I don't yet know a good way to debug that, other than studying the code. It appears to be timing related, perhaps too much interrupt overhead, not sure.

yet everything hangs soon after.

If the mouse program can be interrupted with ^C after the hang, then likely the bug is just speed or interrupt overhead related in the serial driver. If the system hangs entirely, then the problem is buffer overrun or something more sinister, most likely in the serial driver and/or its interrupt processing code.

Also I noted that those readings are terribly delayed:

This could be an indication of too much interrupt overhead for the 8086 to keep up, but that's not a bug, and the serial chip should just indicate a data overrun in that case and drop input.

Mellvik commented 4 years ago

@pawosm-arm,

thanks for the run-through, which made me refresh some old specs and more. The 8250 serial chip (and clones) which is being used on most PX/XT/AT systems is normally driven by a 1.8432MHz oscillator, which limits the baud rate to 58000. At this speed the error rate is quite high, so 32400 is usually the max useable speed.

Now, if the cable and software allows for real HW flow control, any speed will work fine since the HW throttles it down to whatever is possible. W/O HW flow control, which is the normal, it is my experience what 9600 is the highest reliable speed on an interruptdriven system, 19200 seems OK on DOS. I don't think direct or via BIOS makes much of a difference. An if you use a null modem cable, disabling hardware flow control, I would expect you to get high character loss above 19200.

My point is: With ELKS, we should not expect more than 9600 unless we know that the HW flow control is indeed working. It may be that the serial interrupt handler is broken. Just as likely (my guess actually) that heavy interrupt load has not been thoroughly tested in ELKS, regardless of source, so this is a chance to do so. Your 8086 (8088??) is a great chance to check what's possible at the utter low end.

I'll connect a terminal to the serial port of my machine soon, activate a getty and check if I can get it working.

@ghaerr, would it be possible to change the TTY data in ps() (thanks for the addition) to something that points back to the /dev/ttyX in use?

—Mellvik

mar. 2020 kl. 13:07 skrev pawosm-arm <notifications@github.com mailto:notifications@github.com>:

Hi @Mellvik https://github.com/Mellvik,

Those bps speeds you quoted seem slightly disappointing, my ZX Spectrum +3 with bit-banging RS232 interface can do better than that on CP/M Plus.

The mode command in FreeDOS started on 8088 and 8086 machines allows to set speed up to 19200bps as it talks to BIOS hence it is limited by that. Similarly device command in CP/M Plus started on Z80 machines allows to set up to 19200bps, yet since RS232 interface on ZX Spectrum +3 is not interrupt driven, working at such speed is limited to write only. When configuring Telix (on FreeDOS), one needs to enter hardware configuration (I/O ports, IRQ) so it can be assumed it bypasses the XT-BIOS and talks to the hardware directly. Also, Telix allows setting hardware flow control. Speed 115200bps needs to be set on both sides (Telix on 8088/8086 side, minicom on Linux side), and it works just fine! One thing I did not mention is that I prepared null-modem cable myself and I've corssed-over not only the data lines but also status lines, so RTS/CTS flow control is done in hardware. The 'miniterm' command on ELKS allows setting speed up to 57600 (it clamps larger values, problem with parsing larger integers on 16-bit machine?). Sending to Linux PC (running minicom) works just fine, even at that speed. Receiving data result in hanging miniterm (and in effect the entire machine) after a while. I did some experiment and found out that varying connection speed or giving -r option (RTS/CTS hardware flow control) does not matter, the difference is that with slower speed characters got missing less frequently, yet the crash is still there. And it's not deterministic how long it takes for this to crash.

I suspect serial port IRQ handling is somehow broken and causes overruns and crashes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/454#issuecomment-598150993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOGOMZVMGV7RR53EMTLRHDF7VANCNFSM4LFYTSVA.

ghaerr commented 4 years ago

would it be possible to change the TTY data in ps() (thanks for the addition) to something that points back to the /dev/ttyX in use?

Yes. However it seems to be a lot of work at the moment. ps uses /dev/kmem and follows a trail of pointers to get its information. That needs to be done and then deal with all the possible config defines for the particular TTY driver linked in (there are many, bioscon, directcon, serialcon, etc).

ghaerr commented 4 years ago

Hello @Mellvik, interesting discussion on miniterm and older hardware.

I suspect serial port IRQ handling is somehow broken and causes overruns and crashes.

The usual reason for this is interrupts enabled within an interrupt routine when not able to handle another interrupt, or nested interrupts with the same thing, or running out of stack. You might look at clock and interrupt routine serial handlers and try a 'cli' in either or both just for testing to see whether that (temporarily) fixes or worsens the problem.

The 'miniterm' command on ELKS allows setting speed up to 57600 (it clamps larger values, problem with parsing larger integers on 16-bit machine?).

Maybe fixing that should be added to your increasing TODO list! :)

Mellvik commented 4 years ago

@ghaerr, Yeah, this is close to heart (need), so yes, consider it on my list.

BTW . the apparent 57600 limit may be a hardware limit, see my post above.

—-Mellvik

mar. 2020 kl. 14:48 skrev Gregory Haerr notifications@github.com:

Hello @Mellvik https://github.com/Mellvik, interesting discussion on miniterm and older hardware.

I suspect serial port IRQ handling is somehow broken and causes overruns and crashes.

The usual reason for this is interrupts enabled within an interrupt routine when not able to handle another interrupt, or nested interrupts with the same thing, or running out of stack. You might look at clock and interrupt routine serial handlers and try a 'cli' in either or both just for testing to see whether that (temporarily) fixes or worsens the problem.

The 'miniterm' command on ELKS allows setting speed up to 57600 (it clamps larger values, problem with parsing larger integers on 16-bit machine?).

Maybe fixing that should be added to your increasing TODO list! :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/454#issuecomment-598194926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOAS6M5QUDMOY5RR6UDRHDRZNANCNFSM4LFYTSVA.

ghaerr commented 4 years ago

Hello @pawosm-arm,

Just wanted to say that since getting real hardware to test on, I have found two major problems that I'm sure fixes the problems brought up in this issue. Both were related to the serial ring buffer, the result of which was dropped characters or possible kernel corruption.

The first issue was that with user programs (e.g., miniterm or mouse) reading from the serial port at the same time that serial input was being received. The kernel was not protecting the ring buffer input/output pointer and length, and could be in the middle of using the pointer to get character data while the serial interrupt routine was also manipulating the same pointer/length. Any time this occurred, which was very regularly at high serial speeds, the system would lose data. The fix entailed turning off interrupts during the time the ring buffer pointer and length are manipulated.

The second issue I created myself around mid-April when the serial input ring buffer size was increased from 64 to 1200 (slip MTU). On slower systems, the kernel and user programs may not be able to keep up with the received data rate, and the ring buffer would (by design) drop characters. When I increased it to 1200, it broke the system as the ring buffer size only works when the buffer size is a power of two. So that also caused lost data. It's now set to 1024. I'm considering options to allow user-specified ring buffer sizes, so that longer packets or terminal emulation data can be received at very high baud rates on slow systems.

At this point, PR #664 fixes both the above problems, and introduces an optional high-speed serial interrupt routine that allows running beyond 19200 on slower systems without a FIFO. I have (short-duration, under 1024 character) receive speeds working as high as 57600 on a Compaq 386 and 115200 baud continuously on a faster desktop 386 using the new driver. For now, COM1 is setup to use the new driver, while COM2 defaults to the original driver (with both ring buffer fixes). The option to set the fast driver is CONFIG_FAST_IRQ4 in include/arch/ports.h.

If you have time to test, it would be nice, so that we can finally close this issue. Thanks!

pawosm-arm commented 4 years ago

This sounds great! I'll test it tomorrow mid day.

Mellvik commented 4 years ago

@ghaerr, This is indeed a milestone - literally lifting ELKS from experiment to tool. I may have said that before, and the improvements over the past months have been remarkable. This tops the list, thank you for another incredible piece of work. You're getting more out of good ol' RS232 than I though possible. I'm happy to have talked you into getting real (ancient) hardware :-).

I look forward to give the changes a real beating(!).

I still think HW flow control must be added, but now there's no rush.

--Mellvik

jun. 2020 kl. 21:02 skrev Gregory Haerr notifications@github.com:

Hello @pawosm-arm,

Just wanted to say that since getting real hardware to test on, I have found two major problems that I'm sure fixes the problems brought up in this issue. Both were related to the serial ring buffer, the result of which was dropped characters or possible kernel corruption.

The first issue was that with user programs (e.g., miniterm or mouse) reading from the serial port at the same time that serial input was being received. The kernel was not protecting the ring buffer input/output pointer and length, and could be in the middle of using the pointer to get character data while the serial interrupt routine was also manipulating the same pointer/length. Any time this occurred, which was very regularly at high serial speeds, the system would lose data. The fix entailed turning off interrupts during the time the ring buffer pointer and length are manipulated.

The second issue I created myself around mid-April when the serial input ring buffer size was increased from 64 to 1200 (slip MTU). On slower systems, the kernel and user programs may not be able to keep up with the received data rate, and the ring buffer would (by design) drop characters. When I increased it to 1200, it broke the system as the ring buffer size only works when the buffer size is a power of two. So that also caused lost data. It's now set to 1024. I'm considering options to allow user-specified ring buffer sizes, so that longer packets or terminal emulation data can be received at very high baud rates on slow systems.

At this point, PR #664 fixes both the above problems, and introduces an optional high-speed serial interrupt routine that allows running beyond 19200 on slower systems without a FIFO. I have (short-duration, under 1024 character) receive speeds working as high as 57600 on a Compaq 386 and 115200 baud continuously on a faster desktop 386 using the new driver. For now, COM1 is setup to use the new driver, while COM2 defaults to the original driver (with both ring buffer fixes). The option to set the fast driver is CONFIG_FAST_IRQ4 in include/arch/ports.h.

If you have time to test, it would be nice, so that we can finally close this issue. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

pawosm-arm commented 4 years ago

I've been testing it at various speeds and found this is now very reliable at baud rates up to 9600:

$ telnet 192.168.0.2
Trying 192.168.0.2...
Connected to 192.168.0.2.
Escape character is '^]'.
login: root
# ps
  PID   GRP  TTY USER STAT CSEG DSEG  HEAP   FREE   SIZE COMMAND
    1     0      root    S 3e45 3fef  3072   2016  15584 /bin/init 
    7     7    1 root    S 4a75 55bf  1166   6746  64912 -/bin/sh 
   13    13   p0 root    S 4a75 6199  1166   6818  64912 -/bin/sh 
    9     7    1 root    R 5a4e 68e6  9376  17051  59776 ktcp -b -d -m 512 -p slip -s 9600 -l /dev/ttyS0 192.168.0.2 192.168.0.1 255.255.255.0 
   11     7    1 root    R 46a2 4213     0   1969  11936 telnetd 
   14    13   p0 root    R 4810 5e87  1024   5952  18624 ps 
#

On Linux side I was setting SLIP connection as such:

slattach -p slip -L -s 9600 /dev/ttyUSB1 &
ifconfig sl0 up 192.168.0.1 pointopoint 192.168.0.2 netmask 255.255.255.0 mtu 512

Unfortunately, at 19200 only ICMP pings were achievable, and at higher speeds even that was not possible, too many TCP checksum errors (I started ktcp -d to see them). I guess to go at 19200 up to 115200 (available on this machine e.g. in Telix), hardware flow control must be supported in the ELKS kernel. Yet since the scope of this ticket is fulfilled, I'm feeling obliged to close it.

BTW, during my tests ktcp was locally patched to accept -m option.

pawosm-arm commented 4 years ago

In case of nxlandmine, now it takes significantly much more time to crash. I even managed to quit naturally, yet just once. And the mouse arrow motion is still very slow and lags behind mouse movement terribly.

ghaerr commented 4 years ago

I've been testing it at various speeds and found this is now very reliable at baud rates up to 9600

Well, that's a definite improvement!

Unfortunately, at 19200 only ICMP pings were achievable, and at higher speeds even that was not possible,

The "fast" driver which you are using (default on COM1 for now) is, I think, the fastest possible under ELKS. After lots of rewrites trying different methods, this version runs with interrupts completely disabled with a very short routine to just write the ring buffer.

Now that I completely understand what is going on with ELKS interrupt handling and its serial I/O, I find it a bit underwhelming that your system can't handle > 9600 baud with the new driver, especially since Telix works with much higher speeds. The fast driver does not empty the FIFO each interrupt in the fast driver, only the CONFIG_NEED_IRQ4 driver, which has higher per-interrupt overhead does that.

What type of UART is ELKS identifying your system as having?

too many TCP checksum errors (I started ktcp -d to see them).

TCP checksum errors seems to be the indication of lost data, although it would be interesting to test using miniterm connected to Linux and increasing the baud rate until seeing dropped characters. I use "ls -lR /" as it produces easily-recognizable columns to determine lost characters.

BTW, ktcp -d isn't actually implemented yet, the IP and TCP checksum errors are always displayed in the current version. It is a bit strange that on both your and my systems, high baud rates produces TCP checksum errors but not IP checksum errors. I need to look further into this.

I guess to go at 19200 up to 115200 (available on this machine e.g. in Telix), hardware flow control must be supported in the ELKS kernel.

If you can, try disabling SLIP and just logging into Linux via miniterm to see what results you get.

Yet since the scope of this ticket is fulfilled, I'm feeling obliged to close it.

Thanks. I think we've solved the "reliability" issue, but we may be able to get more speed yet.

BTW, during my tests ktcp was locally patched to accept -m option.

Thanks, I just noticed this, ignore my request for this on PR.

ghaerr commented 4 years ago

In case of nxlandmine, now it takes significantly much more time to crash. I even managed to quit naturally, yet just once. And the mouse arrow motion is still very slow and lags behind mouse movement terribly.

Thanks for your testing. I will look further into this. @georgp24 mentions some additional issues with nano-X in #552 that suggest bugs in nxlandmine after a while.

Have you tested mouse motion with "mouse"? The mouse only runs at 1200 baud and ELKS serial driver should easily keep up with the serial data. The "mouse" program should keep up and never die.

pawosm-arm commented 4 years ago

Have you tested mouse motion with "mouse"? The mouse only runs at 1200 baud and ELKS serial driver should easily keep up with the serial data. The "mouse" program should keep up and never die.

It does not die indeed, HOWEVER, there's some difference in how it quits (with ^C) after longer use with lots of motion. When I press ^C not long after starting mouse program, it quits as expected. But when I run it for longer (with lots of motion), pressing ^C causes immediate logout and I need to log in as root again... Does this uncover some hidden issue? Also I wonder how big the buffer was allocated, with fast moves and intermittent clicks it can react and print changed button state a good minute after I pressed it, printing all the mouse positions I was imposing in between. This seem to require enormous buffer! Where is it? In kernel or in userspace? IMO it it should drop the content from time to time, no one requires so detailed information about mouse motion.

pawosm-arm commented 4 years ago

What type of UART is ELKS identifying your system as having?

It prints:

ttyS0 at 0x3f8, irq 4 is an 8250

pawosm-arm commented 4 years ago

Now that I completely understand what is going on with ELKS interrupt handling and its serial I/O, I find it a bit underwhelming that your system can't handle > 9600 baud with the new driver, especially since Telix works with much higher speeds.

Note that Telix uses hardware flow control (e.g. with hardware flow control enabled it detects and complains when there's no 'other side' connected) and Telix runs under DOS, hence it has all of the CPU for its own: no time sharing, no preemptive scheduling and the IRQ line that one needs to configure in Telix is entirely handled in userspace with no bottom halves. It was just easier for its developers to achieve higher speed.

Yet still, achieving at least 19200 (to match at least with ZX Spectrum +3 running CP/M) would be desirable.

pawosm-arm commented 4 years ago

If you can, try disabling SLIP and just logging into Linux via miniterm to see what results you get.

I did some test. I have putty on Linux and it also have serial port terminal. It allows copy+paste of larger text blocks (not waiting for transfer to complete, just to make lots of communication).

So I pasted GNU licence several times. At 9600bps all of the copies went through. At 19200 something weird happend. First copy seemed OK, but the subsequent copies looked as if some of the lines appeared in wrong order, as if wrong part of the serial buffer was copied to the userspace (namely, to miniterm).

Note that miniterm must be started with '-l' option for this test.

ghaerr commented 4 years ago

It does not die indeed, HOWEVER, there's some difference in how it quits (with ^C) after longer use with lots of motion. When I press ^C not long after starting mouse program, it quits as expected. But when I run it for longer (with lots of motion), pressing ^C causes immediate logout and I need to log in as root again...

Sounds like there may still be corruption occurring... with nothing else running and only moving the mouse, there should never be a case where ELKS becomes unstable. Usually, when a program exits and the shell exits immediately thereafter, as described above, this indicates stack overflow or corruption, although in some cases it can happen as a result of the console being in raw mode and the shell read returning 0, which causes the shell to exit.

It is still quite complicated debugging this, as it may not be the serial driver itself that is the problem. For instance, with ktcp I have identified (not solved) a serious kernel bug using the /dev/tcpdev transfer mechanism for sockets that sometimes results in the kernel returning a longer buffer than requested by read(). When this happens, of course, the receiving program is toast and no recovery possible in most circumstances. It could still be interesting to try using miniterm using ls -lR / with streaming serial data to actually get a visual on the character loss that is happening, if any.

Does this uncover some hidden issue?

Possibly, but hard to say without getting a clearer visual. Please try miniterm when you have time, thanks!

Also I wonder how big the buffer was allocated, with fast moves and intermittent clicks it can react and print changed button state a good minute after I pressed it, printing all the mouse positions I was imposing in between.

There is no buffer allocated in mouse. Instead, it busy loops with a read() of a small number of characters, which causes the kernel to read the 1024 byte ring buffer for received data.

This seem to require enormous buffer! Where is it? In kernel or in userspace?

No, the buffer is small, 1024 bytes. It is allocated when the serial port is opened (configurable in ntty.h to a power-of-two). It is located in kernel data space and has always been used by the system to read tty or serial data. The new driver, located in arch/186/drivers/char/serial.c, has its interrupt routine fast_com1_irq which, with interrupts disabled, just reads a single character from the UART and puts it in the ring buffer.

IMO it it should drop the content from time to time, no one requires so detailed information about mouse motion.

Yes, the design of using a ring buffer is such that the interrupt routine will drop data when there is no more space. Here is the code:

/*
 * Fast serial driver for slower machines. Should work up to 58400 baud.
 * No ISIG (tty signal interrupt) handling for shells, used for fast SLIP transfer.
 *
 * Specially-coded fast C interrupt handler, called from asm irq_com[12] after saving
 * scratch registers AX,BX,CX,DX & DS and setting DS to kernel data segment.
 * NOTE: no parameters can be passed, nor any code written which
 * emits code using SP or BP addressing, as SS is not set and not guaranteed to equal DS.
 * Use 'ia16-elfk-objdump -D -r -Mi8086 serial.o' to look at code generated.
 */
void fast_com1_irq(void)
{
    struct serial_info *sp = &ports[0];
    char *io = sp->io;
    struct ch_queue *q = &sp->tty->inq;
    unsigned char c;

    c = INB(io + UART_RX);              /* Read received data */
    if (q->len < q->size) {
        q->base[(unsigned int)((q->start + q->len) & (q->size - 1))] = c;
        q->len++;
    }
}

The line q->len < q->size is the part that checks whether there is any space in the buffer.

ghaerr commented 4 years ago

I did some test. I have putty on Linux and it also have serial port terminal. It allows copy+paste of larger text blocks (not waiting for transfer to complete, just to make lots of communication).

You can enable a serial port for login access by using systemctl start getty@ttyS0 on most systems. That will allow you to skip putty and just login from ELKS using miniterm. You can then run ls -lR /, or sometimes "cat file" with a known number of bytes.

So I pasted GNU licence several times. At 9600bps all of the copies went through. At 19200 something weird happend. First copy seemed OK, but the subsequent copies looked as if some of the lines appeared in wrong order, as if wrong part of the serial buffer was copied to the userspace (namely, to miniterm).

Ok, that is pretty strange. When the ring buffer is full, it can skip large blocks of input. You have to make sure that, in fact, the data is duplicated and not just similar data from the GNU license. Perhaps use a specially-constructed file so that you can see exactly what portions and where it is dropped or duplicated.

However, I have seen this strangeness you describe when I debugged my own error of using a non-power-of-two ring buffer... the data was duplicated. This issue is completely debugged in the PR you're testing with though.

Another thing to try, depending on your system, is to change the INB and OUTB macros in serial.c to use what Linus used for Linux originally - inb_p and out_p asm calls. These were discussed on the Linux list many years ago and the macros (in arch/io.h) add a 1 us delay after any in or out instruction. I could not find a reason that the early PCs required this delay, so I took out the delays in this latest version, and things run fine on my Compaq and other desktop systems. Note, that if your system does not need these delays, adding them back will slow things down a bit further.

ghaerr commented 4 years ago

Note that miniterm must be started with '-l' option for this test.

I missed that. It would be preferable to not have any of the special add_lf handling that you added when doing these tests, as that portion of miniterm uses extra write calls which end up turning off interrupts during the system call entry, etc. The idea is to run miniterm with no special options, which allows it to both read and write buffers as quick as possible while testing the serial in routine. Nothing is perfect, though, and there are still lots of areas where there could be lurking bugs. I originally thought miniterm was buggy when fixing the serial routines, but now believe that it was interrupt handling, which I think is completely fixed. That said, we may still be able to get you reading 19.2k or above once we fully understand what/where the data loss is.

The regular driver, used by setting CONFIG_NEED_IRQ4 in arch/ports.h, can be setup now to report serial overruns and framing errors. This requires doing the above and changing an #if 0 in the serial.c rs_irq routine, if you are interested. We can't track data overruns nor call printk in the fast serial routine due to complicated reasons which I can explain if you're interested.

pawosm-arm commented 4 years ago

Is inb_p/outb_p defined anywhere for ELKS?

ghaerr commented 4 years ago

Is inb_p/outb_p defined anywhere for ELKS?

Yes, as I mentioned above, they are in include/arch/io.h. To use them, edit line 42 of i86/arch/drivers/char/serial.c, where they are defined (and discusses inb_b/outb_p).

pawosm-arm commented 4 years ago

Ah, yes, it's there. It was a long day, somehow I didn't notice it.

pawosm-arm commented 4 years ago

You can enable a serial port for login access by using systemctl start getty@ttyS0 on most systems. That will allow you to skip putty and just login from ELKS using miniterm. You can then run ls -lR /, or sometimes "cat file" with a known number of bytes.

It's systemctl start serial-getty@ttyS0 on Gentoo Linux. It took some time before I found how to setup baud rate properly for this (never use --keep-baud option!!!). What I found is that 9600bps and 19200bps works almost the same (except that at 19200bps the speedup is noticeable). Yet when they stuck, they stuck the same way. E.g. it can print almost all output of dmesg command and suddenly stop in the middle of a line. And then, when I press Enter, new propmt appears as if nothing happened. So it's not that bad with 19200bps, yet still there's something that can stuck from time to time.

ghaerr commented 4 years ago

What I found is that 9600bps and 19200bps works almost the same (except that at 19200bps the speedup is noticeable). Yet when they stuck, they stuck the same way. E.g. it can print almost all output of dmesg command and suddenly stop in the middle of a line. And then, when I press Enter, new propmt appears as if nothing happened. So it's not that bad with 19200bps, yet still there's something that can stuck from time to time.

Interesting. So what you're saying is: 9600 works perfectly every time, but 19200 works without any loss at all, until it just stops completely. Then, you can hit Enter and a prompt shows up?

That sounds to me like 19200 is using up all the CPU cycles just handling the UART read and inserting into the ring buffer... with nothing else able to run (including miniterm). Then, when the ring buffer is full (whose code is shown above), the characters are dropped completely, while still receiving data and using up CPU cycles. When the blast of data finally stops, the system recovers and sending an Enter results in a prompt.

What you might do here would be to change the value of RSINQ_SIZE from 1024 to 4096 in include/linuxmt/ntty.h. That will give a much larger buffer. There likely isn't room (yet) to make this 8192 as its in the kernel data segment, and must be a power of two. That might help. Otherwise, we may need to implement software flow control. You can test this, as I have on slower systems by rapidly using ^S/^Q to throttle the data within the 1024 buffer window. If that works, then we could get away with software flow control. This is what I have to do on the slower compaq 386 with speeds 38400 and 57600, as the system doesn't lose data but can't keep up, and the ring buffer overflows.

pawosm-arm commented 4 years ago

That sounds to me like 19200 is using up all the CPU cycles just handling the UART read and inserting into the ring buffer... with nothing else able to run (including miniterm). Then, when the ring buffer is full (whose code is shown above), the characters are dropped completely, while still receiving data and using up CPU cycles. When the blast of data finally stops, the system recovers and sending an Enter results in a prompt.

It's possible. PC XT systems of the era were dramatically underpowered.

pawosm-arm commented 4 years ago

What you might do here would be to change the value of RSINQ_SIZE from 1024 to 4096 in include/linuxmt/ntty.h.

I've changed RSINQ_SIZE to 4096 in all of the .h files where I've found it. It takes longer stuff to block it now, yet it still blocks...

pawosm-arm commented 4 years ago

...but inb_p/outb_p helped a lot! Now I'll try to revert RSINQ_SIZE change and test *_p change only.

pawosm-arm commented 4 years ago

Nah, it needs both inb_p/outb_p .and. RSINQ_SIZE change to 4096 to prevent 19200bps from stalling.

pawosm-arm commented 4 years ago

Cancel above. It makes things slightly better, but not perfect. E.g. I managed to establish telnet connection (using ktcp) at 19200, yet it stuck as I was doing 'ps':

$ telnet 192.168.0.2
Trying 192.168.0.2...
Connected to 192.168.0.2.
Escape character is '^]'.
login: root
# ps
  PID   GRP  TTY USER STAT CSEG DSEG  HEAP   FREE   SIZE COMMAND
    1     0      root    S 3e49 3ff3  3072   2016  15584 /bin/init 
    7     7    1 root    S 4a79 55c3  1166   6746  64912 -/bin/sh 
   13    13   p0 root    S 4a79 619d  1166   6818  64912 -/bin/sh

Also, repeating dmesg command in miniterm stuck eventually...

pawosm-arm commented 4 years ago

Ha, I didn't notice it at firtst! Above telnet test failed with IP: BAD CHECKSUM and panic "Unable to get timer"!

ghaerr / elks

Serial port access unreliable in ELKS #454