ghaerr / elks

Embeddable Linux Kernel Subset - Linux for 8086

Other

1.01k stars 108 forks source link

INTR not working #676

Closed Mellvik closed 4 years ago

Mellvik commented 4 years ago

^C (on serial) does not seem to have any effect whatsoever in ash or sash. Only as 'line kill' in linenoise.

--Mellvik

Mellvik commented 4 years ago

Addition: works fine on console, so this is a serial issue.

-M

ghaerr commented 4 years ago

^C (on serial) does not seem to have any effect whatsoever in ash or sash.

Yes, this is a known issue with the new "CONFIG_FAST_IRQ4" serial driver. Because it skips all ELKS overhead and tries to operate at maximum speed, there is no TTY line processing (including ^C). Thus, the driver isn't great for getty connections into ELKS. The FAST driver works well for outgoing serial connections using miniterm though.

Currently, the only workaround is setting CONFIG_NEED_IRQ4 in ports.h and recompiling. This will use the older (original) driver. Since that's not very user-friendly, I'm investigating other mechanisms of having both interrupt routines compiled in and having the serial interrupt switchable based on some kind of TTY line mode.

During the extensive testing of serial driver performance and after identifying the nasty ring buffer input overrun problems, I noticed that the CONFIG_NEED_IRQ4 (original) driver can keep up at 19200 on both the testing 386 desktop and Compaq Portable systems. It would be interesting to test using the standard driver on your system versus the FAST driver when you find time to do serial network testing. Note that the FAST driver does not empty a hardware FIFO each interrupt, only a single character. The original driver handles reading all FIFO characters received in a single interrupt. So there are currently tradeoffs on which serial driver works best for individual applications. I have been able to test FIFO on real hardware since I don't have a modern card for either of my systems.

Mellvik commented 4 years ago

OK; I had missed that one.

Your idea about using tty mode switching to select 'fast' or 'normal' driver is good, maybe even optimal. I suggest switching driver when in raw mode, otherwise use the driver with normal tty processing.

My testing indicates that there may still be issues with the fast driver. More on that in a separate message.

--Mellvik

jul. 2020 kl. 17:36 skrev Gregory Haerr notifications@github.com: ^C (on serial) does not seem to have any effect whatsoever in ash or sash.

Yes, this is a known issue with the new "CONFIG_FAST_IRQ4" serial driver. Because it skips all ELKS overhead and tries to operate at maximum speed, there is no TTY line processing (including ^C). Thus, the driver isn't great for getty connections into ELKS. The FAST driver works well for outgoing serial connections using miniterm though.

Currently, the only workaround is setting CONFIG_NEED_IRQ4 in ports.h and recompiling. This will use the older (original) driver. Since that's not very user-friendly, I'm investigating other mechanisms of having both interrupt routines compiled in and having the serial interrupt switchable based on some kind of TTY line mode.

During the extensive testing of serial driver performance and after identifying the nasty ring buffer input overrun problems, I noticed that the CONFIG_NEED_IRQ4 (original) driver can keep up at 19200 on both the testing 386 desktop and Compaq Portable systems. It would be interesting to test using the standard driver on your system versus the FAST driver when you find time to do serial network testing. Note that the FAST driver does not empty a hardware FIFO each interrupt, only a single character. The original driver handles reading all FIFO characters received in a single interrupt. So there are currently tradeoffs on which serial driver works best for individual applications. I have been able to test FIFO on real hardware since I don't have a modern card for either of my systems.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Mellvik commented 4 years ago

The new driver is what I'd call a massive improvement. It is now possible to dump text/data into an ELKS window and get echo consistently. My first test last night @ 19200 bps (most recent commit, 386/20, no fifo) showed a few lost characters towards the end of the file (approx 17.5kbytes). Curious, I reduced speed to 9600 and got increased losses. Then continuing this morning, 4800 bps increased losses even more. Not as expected. Some specifics:

The loss is repeatable. Dumping the same file (really the same size file) the losses appear at exactly the same byte count and every occurrence shows exactly the same # of characters lost.
@ 4800 and 19200, 13 consecutive bytes are lost at predeterminable intervals
@ 19200 (filesize 17,408 bytes = 0x100 lines @ 68 chars each) the dropouts happen at byte count 8184, 12260 and 16333
@ 4800 bps the first dropout is @ byte count 5118, further down the losses increase, but the position where losses happen are about the same as for 19200.

What makes me think there may still be a bug here, is the fact that lower speed increase losses. There are no overrun reports /messages. Further, switching to a much slower machine (286/12) does not change anything. The behaviour is exactly the same - on the byte.

--Mellvik

ghaerr commented 4 years ago

Pretty strange. With decreasing baud rates and more errors, it seems the problem is more time-related than with the serial code, especially with a ring buffer of only 1k bytes.

It is now possible to dump text/data into an ELKS window

What do you mean "ELKS window"? Are you using three wire directly from Linux to ELKS /dev/ttyS0, or through a terminal emulator, etc.

What exactly are you using for testing? That is, what program is sending the data, and do we know it is not buggy? Are you receiving using 'sercat', or "cat > file"? Those two programs have different buffer sizes before they ask ELKS to write the file, although in almost all cases a serial read() will return with less characters than asked for, depending on TTY mode. Sercat was written to use raw mode so as to not use excessive CPU with echoing, etc.

Finally, if the bug is completely repeatable at byte 8184, try sending an exactly 8192 byte file, rather than longer to see whether this still drops the last 8 bytes.

ghaerr commented 4 years ago

I suggest switching driver when in raw mode, otherwise use the driver with normal tty processing.

While seemingly a good idea, be aware that sh uses the linenoise library, which switches from raw to cooked mode on every line of input (command). That is, logging in to ELKS from serial puts one in raw mode at the shell prompt. Just after reading each command, the mode is switched to cooked and fork/exec, etc. I don't think it necessarily a good idea to switch TTY drivers at the shell prompt without the user knowing. In any case, all of this can wait, the CONFIG_NEED_IRQ4 driver works very well for normal serial needs, and we're trying to determine whether the CONFIG_FAST_IRQ4 driver works well for higher speed (38400-115200) speed SLIP networking. Depending on the speed of the CPU, the driver should support one of the those higher speeds and not drop characters ever, if the received burst-length is < 1024 bytes. These were my design criterion for calling the fast driver a "success".

Mellvik commented 4 years ago

It is now possible to dump text/data into an ELKS window

What do you mean "ELKS window"? Are you using three wire directly from Linux to ELKS /dev/ttyS0, or through a terminal emulator, etc.

Sorry, I keep calling it that, that's what it is to me. Still, I don't understand your question. What's the difference? To me, any program that connects to a command line on another system is a terminal emulator of sorts. FWIW - I'm still using screen. From my perspective it doesn't matter. I'm looking at the raw data. What exactly are you using for testing? That is, what program is sending the data, and do we know it is not buggy? Are you receiving using 'sercat', or "cat > file"? Those two programs have different buffer sizes before they ask ELKS to write the file, although in almost all cases a serial read() will return with less characters than asked for, depending on TTY mode. Sercat was written to use raw mode so as to not use excessive CPU with echoing, etc.

Yes, I'm aware of sercat, and I guess we've been down this road before. Using cat > /dev/null - with or without stty in front and back - has been my friend for serial testing for aeons, and works for me. And as before, to me loosing characters when there is overload is a given – as long as we know where the leak is and why it's leaking. In fact, this may not be related to the serial driver at all. Actually, the symptoms being so similar to what we've seen before, indicate that it isn't. And if that's the case, not even hardware flow control is going to make the connections reliable. Finally, if the bug is completely repeatable at byte 8184, try sending an exactly 8192 byte file, rather than longer to see whether this still drops the last 8 bytes.

It does … @19200.

—Mellvik

ghaerr commented 4 years ago

Just trying to understand what you are actually doing...

So - you're using screen on Linux to send a file over the serial line from Linux directly to ELKS, where you've logged in and running a shell, where you've typed "cat > /dev/null" to accept those characters, which are not being written to a file, but discarded to /dev/null. And then looking at the screen echoed characters (raw data) to determine what characters are lost, and where.

Very strange that the last 8 bytes are lost...

It might be interesting to use "cat > file" to write the data to ELKS disk instead of just visually inspecting it. I'm trying to get a handle on how we can know which exact bytes are lost on a large file by visually looking at it, when the screen itself only shows 1920 characters. It is possible, though improbable, that the direct console could be losing characters at high speed.

Mellvik commented 4 years ago

I suggest switching driver when in raw mode, otherwise use the driver with normal tty processing.

While seemingly a good idea, be aware that sh uses the linenoise library, which switches from raw to cooked mode on every line of input (command). That is, logging in to ELKS from serial puts one in raw mode at the shell prompt. Just after reading each command, the mode is switched to cooked and fork/exec, etc. I don't think it necessarily a good idea to switch TTY drivers at the shell prompt without the user knowing.

Aha, makes sense. I agree. BTW I'm noticing some odd linenoise behaviour now that I'm occasionally looking at the raw data passing back and forth, such as hundreds of invisible ESC sequences after a cat > /dev/null (after transfer finished). Just a couple of occurrences and not repeatable, but odd indeed. I'll find more details next time I see it.

In any case, all of this can wait, the CONFIG_NEED_IRQ4 driver works very well for normal serial needs, and we're trying to determine whether the CONFIG_FAST_IRQ4 driver works well for higher speed (38400-115200) speed SLIP networking. Depending on the speed of the CPU, the driver should support one of the those higher speeds and not drop characters ever, if the received burst-length is < 1024 bytes. These were my design criterion for calling the fast driver a "success".

FWIW - I'm not disputing that characteristic. Nor the criterion. I'm reporting what I'm observing and I believe there is a problem still unsolved in there. And like I said in my previous message. I may well be somewhere else than in the serial driver.

I keep forgetting that the point of this issue was something else - the missing INTR, which to me is a serious problem. You've made it clear that using raw mode for a driver switch is not an option, and I support using some other stty-mode to switch (how about stty slip /-slip)??

BTW - I'm going to repeat the same tests with the 'regular' driver just for the heck of it. I'm still using the 286.

—Mellvik

Mellvik commented 4 years ago

Just trying to understand what you are actually doing...

So - you're using screen on Linux to send a file over the serial line from Linux directly to ELKS, where you've logged in and running a shell, where you've typed "cat > /dev/null" to accept those characters, which are not being written to a file, but discarded to /dev/null. And then looking at the screen echoed characters (raw data) to determine what characters are lost, and where.

That's it. And of course the loss may be in the output, the key is to find 'the leak'. Very strange that the last 8 bytes are lost…

The predictability is a curse and a blessing - a confirmation of a problem and a pointer... It might be interesting to use "cat > file" to write the data to ELKS disk instead of just visually inspecting it. I'm trying to get a handle on how we can know which exact bytes are lost on a large file by visually looking at it, when the screen itself only shows 1920 characters. It is possible, though improbable, that the direct console could be losing characters at high speed.

I have avoided that because the floppy is so slow, giving presumably unpredictable results. But now that I'm using the 286, I can mount a FAT filesystem and send the output there. I'll get back on that.

—Mellvik

ghaerr commented 4 years ago

BTW I'm noticing some odd linenoise behaviour now that I'm occasionally looking at the raw data passing back and forth, such as hundreds of invisible ESC sequences after a cat > /dev/null (after transfer finished). Just a couple of occurrences and not repeatable, but odd indeed. I'll find more details next time I see it.

sh, via linenoise, sends an "invisible" DEC sequence to read the cursor position when first started. This used to be sent every command line prompt, (I recently removed that) and will be sent to screen which may or not interpret it properly, and the results of the terminal emulator cursor request are sent back to the shell which interprets it and determines the line width. This sequence is always sent unless TERM=dumb. There is a lot going on under the hood with sh to effect the line editing. Log in as toor and use sash to eliminate all these extra variables in serial testing.

keep forgetting that the point of this issue was something else - the missing INTR, which to me is a serious problem.

Yes. The CONFIG_FAST_DRIVER was built for fast networking, and won't work well for shell access, since it doesn't support calling other kernel routines during serial input, which process any ISIG line characters.

I support using some other stty-mode to switch (how about stty slip /-slip)??

Something like that, yes. Linux uses "TTY line disciplines" which wholly switch the serial port to separate "disciplines" which don't use the older cooked/raw/isig/etc stty functionality for SLIP, for instance.

BTW - I'm going to repeat the same tests with the 'regular' driver just for the heck of it. I'm still using the 286.

Great, I believe that the regular driver could work for both networking and logins for speeds <= 19200. I'd like to hear about the results.

That's it. And of course the loss may be in the output, the key is to find 'the leak'.

BTW, I was using Linux 'screen' in the opposite direction, that is, to receive characters sent from ELKS to Linux. I was using miniterm on ELKS connected to screen on Linux, and just typing data back and forth. Believe it or not - screen loses characters when typed from miniterm, even very slowly. Characters typed from Linux to ELKS (screen to miniterm) were never lost. So for all I know there's a bug in screen. I haven't been able to determine whether the bug is in miniterm or screen. The very strange thing is that if, in the same session, I disconnect screen and instead turn on Linux getty, I can login to Linux from ELKS miniterm and never lose a character in either direction.

Because of things like this, I don't yet trust screen - and it could be dropping sending the last 8 serial characters for all I know with your tests.

I have avoided that because the floppy is so slow, giving presumably unpredictable results.

Given that the test data is thankfully completely repeatable, it would be interesting to see the disk contents and whether that differs from the displayed contents.

Mellvik commented 4 years ago

BTW - I'm going to repeat the same tests with the 'regular' driver just for the heck of it. I'm still using the 286.

Great, I believe that the regular driver could work for both networking and logins for speeds <= 19200. I'd like to hear about the results.

The results are in - and completely exonerates the new driver. With the old driver, the behaviour is exactly the same at 4800 and 9600, and for a while @ 19200, until it completely breaks down. That's it. And of course the loss may be in the output, the key is to find 'the leak'.

BTW, I was using Linux 'screen' in the opposite direction, that is, to receive characters sent from ELKS to Linux. I was using miniterm on ELKS connected to screen on Linux, and just typing data back and forth. Believe it or not - screen loses characters when typed from miniterm, even very slowly. Characters typed from Linux to ELKS (screen to miniterm) were never lost. So for all I know there's a bug in screen. I haven't been abl to determine whether the bug is in miniterm or screen. The very strange thing is that if, in the same session, I disconnect screen and instead turn on Linux getty, I can login to Linux from ELKS miniterm and never lose a character in either direction.

Because of things like this, I don't yet trust screen - and it could be dropping sending the last 8 serial characters for all I know with your tests.

OK; sounds familiar, I have a theory about this. First, I've used screen for many purposes for quite some time (years) w/o a hitch, so I trust it. Which doesn't mean it's bug free. I've seen what you're describing many times, and the problem has always been me. Either I've had several screen instances running on the same port, unpredictably eating bytes from the stream, or there was a getty running even though I thought I had killed and disabled it, and it was eating half of my input or more. I have avoided that because the floppy is so slow, giving presumably unpredictable results.

Given that the test data is thankfully completely repeatable, it would be interesting to see the disk contents and whether that differs from the displayed contents.

Agreed, I'll check that.

BTW (off topic), elks refuses to mount my FAT file system on /dev/bda4 (no such device). I haven't tried this for months, has there been any changes recently?

—Mellvik

Mellvik commented 4 years ago

Saving output to floppy:

the content on the disk is exactly as echoed during transfer
except for the first drop (this is @ 4800bps), the drops are now slightly larger, but they occur at exactly the same places in the file.
the first drop is 13 bytes, the next two (3 alltogether) are 68 + 13, 68 being the line length.
the 13 byte drop is the same as recorded with the new driver - there is some magic to '13' in here somewhere.

Some predictability, but I have no clue at this time. I'll let it rest for now, to be revisited when the networking is fixed.

—Mellvik

jul. 2020 kl. 18:05 skrev Helge Skrivervik helge@mymayday.com:

BTW - I'm going to repeat the same tests with the 'regular' driver just for the heck of it. I'm still using the 286.

Great, I believe that the regular driver could work for both networking and logins for speeds <= 19200. I'd like to hear about the results.

The results are in - and completely exonerates the new driver. With the old driver, the behaviour is exactly the same at 4800 and 9600, and for a while @ 19200, until it completely breaks down. That's it. And of course the loss may be in the output, the key is to find 'the leak'.

BTW, I was using Linux 'screen' in the opposite direction, that is, to receive characters sent from ELKS to Linux. I was using miniterm on ELKS connected to screen on Linux, and just typing data back and forth. Believe it or not - screen loses characters when typed from miniterm, even very slowly. Characters typed from Linux to ELKS (screen to miniterm) were never lost. So for all I know there's a bug in screen. I haven't been abl to determine whether the bug is in miniterm or screen. The very strange thing is that if, in the same session, I disconnect screen and instead turn on Linux getty, I can login to Linux from ELKS miniterm and never lose a character in either direction.

Because of things like this, I don't yet trust screen - and it could be dropping sending the last 8 serial characters for all I know with your tests.

OK; sounds familiar, I have a theory about this. First, I've used screen for many purposes for quite some time (years) w/o a hitch, so I trust it. Which doesn't mean it's bug free. I've seen what you're describing many times, and the problem has always been me. Either I've had several screen instances running on the same port, unpredictably eating bytes from the stream, or there was a getty running even though I thought I had killed and disabled it, and it was eating half of my input or more. I have avoided that because the floppy is so slow, giving presumably unpredictable results.

Given that the test data is thankfully completely repeatable, it would be interesting to see the disk contents and whether that differs from the displayed contents.

Agreed, I'll check that.

BTW (off topic), elks refuses to mount my FAT file system on /dev/bda4 (no such device). I haven't tried this for months, has there been any changes recently?

—Mellvik

ghaerr commented 4 years ago

the content on the disk is exactly as echoed during transfer the 13 byte drop is the same as recorded with the new driver - there is some magic to '13' in here somewhere.

If the drop the same and in exactly the same place using two different drivers - I would say that this issue is not in the serial driver.

It could be in the TTY driver, or somewhere up the "receive stack".

On the other hand, it could be in screen. Please test using screen connecting to a non-ELKS box, using everything else the same, sorry, I am now very interested in this saga!

I would like to ask that you test using miniterm from ELKS logging on to Linux via serial with the old driver. It seems that it should work well (with the possible exception of 13 lost characters every 8048 bytes) all the way up to 19200 baud. That would still be an improvement over previous ELKS old driver, right?

I can't get any system I have here to drop any received data, ever, up through 19200 baud. At 38400 and 57600, the Compaq can't keep up with received data; this can be tested using ^S/^Q on ls -lR /bin listings and watching for a delay before stopping. My desktop 386 will keep up and never lose a character all the way through 115200 baud.

the 13 byte drop is the same as recorded with the new driver - there is some magic to '13' in here somewhere.

As I wrote the above, it occurred to me - FIFO. It is possible that screen isn't buggy, but instead the problem is in the FIFO. I don't have any FIFO Uarts on any of my systems. Both the new and old drivers turn on the FIFO if present, even though the new driver only reads one character per interrupt (and gets another interrupt immediately thereafter if the FIFO is not empty). The FIFO is programmed to interrupt at character 14.

Another thought is that the last 13 characters in the FIFO do not produce an interrupt and do not timeout, for some reason, until the 14th character is received.

To test this theory, turn off HW FIFO in config and recompile the kernel. Then retest. I'm still interested in 'screen' results to a non-ELKS system, just to eliminate that variable.

I've seen what you're describing many times, and the problem has always been me. Either I've had several screen instances running on the same port, unpredictably eating bytes from the stream, or there was a getty running even though I thought I had killed and disabled it, and it was eating half of my input or more.

Thanks. I tried turning off getty, which didn't fix it. I didn't consider screen could be secretly running another instance, but didn't see it in ps. I'll check that again.

ghaerr commented 4 years ago

elks refuses to mount my FAT file system on /dev/bda4 (no such device). I haven't tried this for months, has there been any changes recently?

I don't think so... take a screenshot of the boot screen showing the partition(s) found/rejected on the drive and open another issue if you can't get any FAT file system mounted. Try a few alternative images first, thanks!

Mellvik commented 4 years ago

elks refuses to mount my FAT file system on /dev/bda4 (no such device). I haven't tried this for months, has there been any changes recently?

I don't think so... take a screenshot of the boot screen showing the partition(s) found/rejected on the drive and open another issue if you can't get any FAT file system mounted. Try a few alternative images first, thanks!

—

Solved:

mount -t MSDOS /dev/bda4 /mnt

mount failed: No such device

mount -t msdos /dev/bda4 /mnt

FAT: me=f8,csz=4,#f=2,floc=1,fsz=34,rloc=69,#d=512,dloc=101,#s=34000,ts=34000 FAT: 17M, fat16 format

Memory lapse (and confusing error message, added to the todo list).

—Mellvik

Mellvik commented 4 years ago

@ghaerr,

This issue is sending me down new paths, discovering new challenges.

the content on the disk is exactly as echoed during transfer the 13 byte drop is the same as recorded with the new driver - there is some magic to '13' in here somewhere.

If the drop the same and in exactly the same place using two different drivers - I would say that this issue is not in the serial driver.

Well, yes - I wanted to vindicate both the new and old drivers. Then I started running into miscellaneous problems. I would like to ask that you test using miniterm from ELKS logging on to Linux via serial with the old driver. It seems that it should work well (with the possible exception of 13 lost characters every 8048 bytes) all the way up to 19200 baud. That would still be an improvement over previous ELKS old driver, right?

It's a big improvement indeed. I can't get any system I have here to drop any received data, ever, up through 19200 baud. At 38400 and 57600, the Compaq can't keep up with received data; this can be tested using ^S/^Q on ls -lR /bin listings and watching for a delay before stopping. My desktop 386 will keep up and never lose a character all the way through 115200 baud.

This is interesting. It's also interesting that the regular factors affecting performance have no (or opposite) effect. I tried minicom, with mixed results: the 17k file transferred w/o losses the first time @ 9600, and it feels like it's slower, i.e. that there is some kind of delay even though the config doesn't add one. Adding 'time' in front of the cat command, the repeating the paste into minicom, gives lots of losses (40%) (this is the old driver). I need to experiment a little more with this in order ro understand the symptoms. the 13 byte drop is the same as recorded with the new driver - there is some magic to '13' in here somewhere.

As I wrote the above, it occurred to me - FIFO. It is possible that screen isn't buggy, but instead the problem is in the FIFO. I don't have any FIFO Uarts on any of my systems. Both the new and old drivers turn on the FIFO if present, even though the new driver only reads one character per interrupt (and gets another interrupt immediately thereafter if the FIFO is not empty). The FIFO is programmed to interrupt at character 14.

Another thought is that the last 13 characters in the FIFO do not produce an interrupt and do not timeout, for some reason, until the 14th character is received.

To test this theory, turn off HW FIFO in config and recompile the kernel. Then retest. I'm still interested in 'screen' results to a non-ELKS system, just to eliminate that variable.

Now this is where the fun starts. I'm using the compaq builtin serial which is 16450, no fifo, so that's ruled out. The idea sent me to test on my 16550A ports, which was utterly unsuccessful. Dumping the 17k file into the Screen window echoes back about 20 chars, then hangs until reboot. So I reconfigured the system to take out FIFO support and now the driver doesn't work at all - for the 16550 ports. The 16450 is unaffected.

IOW - I need to do some more testing to narrow this down - do you by any chance have a FIFO serial card yet? Maybe in the stash?

BTW I'm going to do some more minicom tests, and It would be interesting if you tested a screen based connection and pasted an xxd or hd output to elks til see where it breaks - say @ 19200. Since I'm getting exactly the same results on different machines with very different performance characteristics, we have strong indications that the problem is on either side but not in the driver. Except the FIFO related, which also need more testing.

—Mellvik

Mellvik commented 4 years ago

Now this is where the fun starts. I'm using the compaq builtin serial which is 16450, no fifo, so that's ruled out. The idea sent me to test on my 16550A ports, which was utterly unsuccessful. Dumping the 17k file into the Screen window echoes back about 20 chars, then hangs until reboot. So I reconfigured the system to take out FIFO support and now the driver doesn't work at all - for the 16550 ports. The 16450 is unaffected.

FYI - I'm working to rule out any hardware related issues on this. Update later.

—Mellvik

ghaerr commented 4 years ago

IOW - I need to do some more testing to narrow this down - do you by any chance have a FIFO serial card yet? Maybe in the stash?

No - not yet. However, you successfully tested the FIFO code earlier, so it could well be hardware related issues. The FIFO code was not modified in the new or old drivers.

I also strongly suggest running getty on Linux and running miniterm from ELKS to Linux, just to test basic connectivity before getting into fast dumps, etc. As I mentioned earlier, doing an "ls -lR /" or something like that is a great way to quickly tell how your incoming serial stream to ELKS is working, and hitting ^S gives a solid indication of how far behind ELKS is from processing the serial characters received (^S is interpreted by Linux not ELKS and a small-to-large stream of characters will continue in miniterm depending on how behind ELKS is). What we're looking for is immediate action on ^S/^Q, rather than delayed.

Mellvik commented 4 years ago

Update on this issue - mostly good news. A lot of time wasted and a long journey that started with connecting 3 serial lines to the elks box (386/20) – and ended with me involuntarily becoming an USB serial line (and screen() ) expert.

The good news: The serial driver - regular/normal version - works well, even at high speed (38400), in the absence of 'disturbing' interrupts (other system activity). Occasional system hangs under heavy load have not proven repeatable.

Useful observations:

When overloaded on input, screen() does indeed discard characters, as suggested by @ghaerr. Overload = big speed difference between input and output, in our case paste speed vs. serial speed. This behaviour is documented in the screen man page. Changing the obuflimit setting in screen does not affect this behaviour. This explains why we observe increasing losses with decreasing speed. The 'overload' comes from pasting large amounts of data into a Mac window containing the screen program running on RaspberryPi Linux, which in turn connects to the ELKS machine via serial. The behaviour has been verified via a back-to-back USB serial connection on the Pi (see below), no ELKS involved: Pasting data works perfectly at 115200, loses characters at 4800.
Minicom and cu do not share this behaviour (and venerable cu has been quite valuable during testing). Still, using screen with a local paste buffer turned out to be the most effective tool: ^A:readbuf to load, then ^A] to paste.
There is no noticeable difference between the 16450 and the 16550(A) compiled w/FIFO. Both take a 17k file @ 19200 w/echo without loss, some loss @ 38400. With no echo (sercat), both take the entire file w/o loss.
When FIFO support is removed via menuconfig, the performance of the 16550s degrades significantly and frequently hangs in such a way that a reboot is required in order to get the port working again. It may be an idea to remove this option.
When several terminals (serial lines) are active, there are occasional hard hangs. To be investigated. It is unclear whether this is related to serial i/o at all.

Hardware-related

[This is not about ELKS, just about serial, USB and experience] Part of the test was to check out ttyS1 and ttyS2 - the 16550 based UARTS. I pulled out a couple of USB serial cables from a drawer, some 232MAX rs232 converters (they yank up the signal voltages from TTL to rs232, which is 10-15V. (pic)) A combination I haven't used much but it seemed to work fine. 3 active serial lines on the ELKS box - plus console, all with gettys running, not bad. Until I started to paste data into the USB lines. The first character OK, the rest garbage. Pasting 10 characters only? No deal. Two? No deal. keyboard repeat? OK, 10 or 20 cps, no problem. A long search commenced - It couldn't be faulty HW since two units had exactly the same symptoms, and the serial ports themselves tested OK using the PI's builtin serial line. Adding to the mystery, a slightly different converter pulled from the same drawer worked fine: And all reported the same chip in dmesg() on Linux - Prolific 2303. Only after consulting the schematics for the RS232 level converter, the problem revealed itself: The power coming out of the USB connection is 5V, the 232MAX chip requires 3.3V. It works - sort of - with 5V, but just barely. Enough to send me on a two day journey which - in retrospect - should have been max 2 hrs. You have been warned :-) !

--Mellvik

ghaerr commented 4 years ago

Thanks @Mellvik for your testing on the serial ports! Interesting observations.

The serial driver - regular/normal version - works well, even at high speed (38400).

The regular serial driver has also worked up through 38400 on both my test systems as well. Even though I am now operating and testing SLIP networking using 115200 on the faster 386 desktop with the "fast" driver, I am thinking of reverting to the "regular" serial driver and 38400 baud SLIP as the default configuration for ELKS. This will allow for SLIP to work out of the box on most systems, as well as for ^C to work when the serial port(s) are used for ELKS logins without custom .config modifications. The "fast" serial driver can be selected when higher baud rates are desired.

When overloaded on input, screen() does indeed discard characters

Good to know that screen is in fact buggy. I have debugged the serial driver using "ls -lR /" and minicom, which allows easy inspection of visual columns and ^S/^Q lag handling to determine usability, along with SLIP checksum errors for serial networking limits.

With no echo (sercat), both take the entire file w/o loss.

Thanks for testing with sercat. In addition to no echo, it also sets the VMIN/VTIME termios values the same way that the ktcp SLIP driver does, which wait until at a single character is received, but also empties the ring buffer if more characters had been received.

When FIFO support is removed via menuconfig, the performance of the 16550s degrades significantly and frequently hangs in such a way that a reboot is required in order to get the port working again. It may be an idea to remove this option.

I don't know why the 16550's don't operate well when their FIFO is not enabled, other than it could be the "regular" serial driver may have too much overhead. I will test further on my systems, neither of which have hardware FIFO, before reverting to the "regular" driver as described above.

We can't yet remove the FIFO config option, as sadly, I have been seeing QEMU operate badly when FIFO is enabled. There is another QEMU bug with the "fast" driver where the serial ports start delaying a character of input after a minute or so of QEMU running... I'm working on finding out the reason; I haven't yet determined if this is only with the new driver or not.

When several terminals (serial lines) are active, there are occasional hard hangs

Does this happen only with COM1 and COM2, or does it require 3 serial lines or COM3 in the mix? I am most interested to understand whether this happens just with two com ports on IRQ 4 and 3 or not.

...the problem revealed itself: The power coming out of the USB connection is 5V, the 232MAX chip requires 3.3V. It works - sort of - with 5V, but just barely.

I guess this means that the 232MAX board can't be powered from a standard USB cable, and needs its own power supply or a hacked-in resistor to get the voltage down to 3.3V, right? I guess that's the fault of the 232MAX board design, unless its instructions clearly state it needs 3.3V Vcc.

Thank you!

Mellvik commented 4 years ago

When overloaded on input, screen() does indeed discard characters

Good to know that screen is in fact buggy. I have debugged the serial driver using "ls -lR /" and minicom, which allows easy inspection of visual columns and ^S/^Q lag handling to determine usability, along with SLIP checksum errors for serial networking limits.

No, this is a documented 'feature', like the issue we had with tcpdump a while back. Screen is still my preferred tool for this kind of testing, for a number og reasons. We all have our preferences. When FIFO support is removed via menuconfig, the performance of the 16550s degrades significantly and frequently hangs in such a way that a reboot is required in order to get the port working again. It may be an idea to remove this option.

I don't know why the 16550's don't operate well when their FIFO is not enabled, other than it could be the "regular" serial driver may have too much overhead. I will test further on my systems, neither of which have hardware FIFO, before reverting to the "regular" driver as described above.

Ok I'll keep this one on my list. We can't yet remove the FIFO config option, as sadly, I have been seeing QEMU operate badly when FIFO is enabled. There is another QEMU bug with the "fast" driver where the serial ports start delaying a character of input after a minute or so of QEMU running... I'm working on finding out the reason; I haven't yet determined if this is only with the new driver or not.

Got it. When several terminals (serial lines) are active, there are occasional hard hangs

Does this happen only with COM1 and COM2, or does it require 3 serial lines or COM3 in the mix? I am most interested to understand whether this happens just with two com ports on IRQ 4 and 3 or not.

This isn't really (AFAIK) serial related, but interactive activity makes it easy to observe. Console and one serial is fine for testing. Say, set ttyS0 to 4800 bps and start hd /bin/vi (which takes a while), then do some work on the console (I didn't do that, I used a second serial). I guess we need to do some testing of the role disk/floppy i/o plays in this too. ...the problem revealed itself: The power coming out of the USB connection is 5V, the 232MAX chip requires 3.3V. It works - sort of - with 5V, but just barely.

I guess this means that the 232MAX board can't be powered from a standard USB cable,

That's right. and needs its own power supply or a hacked-in resistor to get the voltage down to 3.3V, right?

Might work, but since the current draw varies, a resistor is not really good enough. You need a voltage regulator, really a transistor and one or two resistors, which is what the unit in the 2nd pic has. In my case it's just an extra cable from the Pi. I guess that's the fault of the 232MAX board design, unless its instructions clearly state it needs 3.3V Vcc.

Not really, this is well documented, the problem was my stupidity combined with the apparently great fit between the two. :-) One takeway is that TTL these days is not 5V but 3.3!

Anyway, back to the ne2k...

--M

ghaerr commented 4 years ago

This isn't really (AFAIK) serial related, but interactive activity makes it easy to observe. Console and one serial is fine for testing. Say, set ttyS0 to 4800 bps and start hd /bin/vi (which takes a while), then do some work on the console (I didn't do that, I used a second serial). I guess we need to do some testing of the role disk/floppy i/o plays in this too.

We need to remove the variable of using a second serial port to see whether hard hangs happen without two serial ports running.

There should be no cases of hard hangs but could be possible that the "regular" serial driver isn't fully reentrant yet. (The "fast" driver runs with interrupts off, so no problems there but may be worth trying both serial ports in "fast" mode.)

With regards to problems with serial I/O or system hang during disk I/O - unfortunately, ELKS always uses the BIOS for floppy I/O, and results could vary widely depending on the BIOS implementation. I suppose it is possible that a consistently occurring interrupt (in addition to clock) could cause problems during BIOS disk I/O. Note though that all ELKS interrupt driver C code, (except the "fast" serial interrupt), run with interrupts ENABLED, they are disabled only for short durations of critical register saving operations. Also, the serial driver basically busy-loops when transmitting all characters of a write to /dev/ttyS0, and currently doesn't allow the task to be switched until each write block is transmitted. We would need to switch to interrupt-driven transmit in order to change this.

Currently, it appears that all the complex asynchronous buffer-management and async I/O routines are implemented in ELKS, but the BIOS floppy driver always waits synchronously for all I/O since it passes control away from ELKS to the BIOS. A received timer or other hw interrupt is coded to not switch stacks (to another process) when the interrupted code was already in the kernel, BIOS, or another interrupt.

I'll plan to test disk I/O during heavy in and outbound serial I/O to my testing to get a better feel for this.

Mellvik commented 4 years ago

This isn't really (AFAIK) serial related, but interactive activity makes it easy to observe. Console and one serial is fine for testing. Say, set ttyS0 to 4800 bps and start hd /bin/vi (which takes a while), then do some work on the console (I didn't do that, I used a second serial). I guess we need to do some testing of the role disk/floppy i/o plays in this too.

We need to remove the variable of using a second serial port to see whether hard hangs happen without two serial ports running.

Apologies, @ghaerr - I misread your comment so my answer was really off the point. The 'demo' I suggested was to show how other system activities affect (in this case) the flow of serial output. This particular case (again, serial output) would be vastly improved by having interrupt-driven serial output. There should be no cases of hard hangs but could be possible that the "regular" serial driver isn't fully reentrant yet. (The "fast" driver runs with interrupts off, so no problems there but may be worth trying both serial ports in "fast" mode.)

Agreed, and wherever the real source of this hang is, it makes sense to take advantage of serial interrupts consistently triggering it. With regards to problems with serial I/O or system hang during disk I/O - unfortunately, ELKS always uses the BIOS for floppy I/O, and results could vary widely depending on the BIOS implementation. I suppose it is possible that a consistently occurring interrupt (in addition to clock) could cause problems during BIOS disk I/O. Note though that all ELKS interrupt driver C code, (except the "fast" serial interrupt), run with interrupts ENABLED, they are disabled only for short durations of critical register saving operations. Also, the serial driver basically busy-loops when transmitting all characters of a write to /dev/ttyS0, and currently doesn't allow the task to be switched until each write block is transmitted. We would need to switch to interrupt-driven transmit in order to change this.

I think we should do that - eventually. Currently, it appears that all the complex asynchronous buffer-management and async I/O routines are implemented in ELKS, but the BIOS floppy driver always waits synchronously for all I/O since it passes control away from ELKS to the BIOS. A received timer or other hw interrupt is coded to not switch stacks (to another process) when the interrupted code was already in the kernel, BIOS, or another interrupt.

I'll plan to test disk I/O during heavy in and outbound serial I/O to my testing to get a better feel for this.

This may be the incentive we need to revisit the 'real' floppy/HD drivers … again eventually. I've been noticing that while floppy I/O is active, NIC interrupts happens but don't get serviced until after floppy i/o completion. This contributes to quite frequent buffer overflow situations on the NIC side (of course I'm pushing it in the name of robustness …).

Anyway, I'll revisit the serial issues and see if I can get a beter take on the pretext fot the (hard) hang situations. [It's a time consuming test. The clunker is in the garage, I'm in my office enjoying serial line access – except when I have to flip the power switch. A remote (iPhone) controlled power switch is on order …]

—Mellvik