ghaerr / elks

Embeddable Linux Kernel Subset - Linux for 8086
Other
1k stars 108 forks source link

getty/serial test report #515

Closed Mellvik closed 4 years ago

Mellvik commented 4 years ago

Summary: Close, but no cigar Testing on physical HW & virtualbox, the latter turns out to lend itself well to serial testing: Serial line set up to listen to a TCP port, which can be connected to using netcat.

  1. There is a problem with the serial driver (or maybe somewhere in the call chain from open() and down to physical) as we've seen before: sash < /dev/ttyS0 > /dev/ttyS0 & should work. When it eventually does, we're almost there. The error message is Cannot create /dev/ttyS0: error 16. Replacing sash with getty or login root doesn't make any difference. Neither does changing the shell we're running from. Replacing ttyS0 with tty2 or tty3 (virtual consoles) works fine. Further - sash < /dev/ttyS0 & works, takes input from serial, output to the console. And vice versa if using ttyS0 for output instead. Seemingly, when the device can accept several opens, this is going to work. (?)
  2. There seem to be a stdio-problem and possibly a file descriptor inheritance problem in getty. The device in arg[1] is never opened by getty. Thus starting getty writes to and reads from the console regardless. More about that below.
  3. Starting getty from inittab 'hangs' (loops) the system. All testing is done from the command line.
  4. Opening the device explicitly in getty and then replacing stdio 0,1,2 before the execv using dup2() works. There are some minor problems with the terms settings - to be sorted out. I cannot figure out why dup2 is accepted but not the shell redirects.

--Mellvik

Skjermbilde 2020-04-03 kl  11 04 32
Mellvik commented 4 years ago

UPDATE: Some adjustments to getty, and serial login is working via inittab.

Still some minor issues (commands echoed twice), but seemingly - when started from inittab, the device gets opened only once, and we're OK.

Great work, @ghaerr !!

ghaerr commented 4 years ago

Some adjustments to getty, and serial login is working via inittab.

Good news! What kind of changes were made to getty?

A few explanations on your statements above:

The error message is Cannot create /dev/ttyS0: error 16.

Errno 16 is device busy, /dev/ttyS0 is protected against being opened twice. Trying to run a shell with both 0 and 1 redirected isn't really the way to do it.

There seem to be a stdio-problem and possibly a file descriptor inheritance problem in getty. The device in arg[1] is never opened by getty.

Long story short on this is that /bin/init opens the argument to getty. There is some special processing of /etc/inittab by 'init' and things aren't as simple as they seem. See 'sys_utils/init.c' for details. We probably shouldn't change init processing at this point.

After the device is opened by init, it execs the inittab-specified program, usually getty, and now getty sets the termio struct as well as the baud rate from the command line (but the device is already opened).

Starting getty from inittab 'hangs' (loops) the system.

Is this fixed?

Opening the device explicitly in getty and then replacing stdio 0,1,2 before the execv using dup2() works.

/bin/init opens the device initially as stated above. I would like to see specifically what was required to get this working. 'init' does the same thing replacing 0,1,2 using dup2.

I cannot figure out why dup2 is accepted but not the shell redirects.

That's the protection in the serial driver that prohibits multiple opens. 'dup2' just duplicates a file descriptor in the kernel without re-opening the device.

To summarize - the changes in the device open and dup2 calls you've made to get this to work may need to be changed in 'init', not 'getty'.

Mellvik commented 4 years ago

@ghaerr, it’s really encouraging being so close to having serial login - and thus full serial support - working. Final physical machine testing is pending - surely just adjustments.

The testing / debugging has left some - from my perspective - interesting experiences, summarized below.

  1. apr. 2020 kl. 17:19 skrev Gregory Haerr notifications@github.com:

Some adjustments to getty, and serial login is working via inittab.

Good news! What kind of changes were made to getty?

When everything works, your fixed getty works perfectly. My adjustments are proposals, because as is, the serial/getty/inittab combination is impossible to debug: getty cannot be run from the command line (because of the single open policy, more on that below), and when run from init, it loops - unless everything on the hardware side is perfect. More on that too below. A few explanations on your statements above:

The error message is Cannot create /dev/ttyS0: error 16.

Errno 16 is device busy, /dev/ttyS0 is protected against being opened twice. Trying to run a shell with both 0 and 1 redirected isn't really the way to do it.

I don’t agree. The unix way is allowing multiple opens and then requesting exclusivity when required (and if possible). Making serial tty different from this breaks reasonable expectations, breaks with history and complicates things (like this example). Further - for serial lines, multiple opens has been a requirement: Using the same line for inbound and outbound connections for example. Having SLIP ‘block’ getty while active would be such a situation. Also, being able to redirect from the shell like showed, has saved my day too many times to remember, in particular when working on serial lines. In other words, my suggestion is to change this. There seem to be a stdio-problem and possibly a file descriptor inheritance problem in getty. The device in arg[1] is never opened by getty.

Long story short on this is that /bin/init opens the argument to getty. There is some special processing of /etc/inittab by 'init' and things aren't as simple as they seem. See 'sys_utils/init.c' for details. We probably shouldn't change init processing at this point.

After the device is opened by init, it execs the inittab-specified program, usually getty, and now getty sets the termio struct as well as the baud rate from the command line (but the device is already opened).

I got this eventually, and indeed this is the traditional unix way, while Linux seem to have deviated with mingetty. Anyway, my proposal for getty is to have it check its origin so to speak: Either use parent process id or compare the name of stdin to argv[1]. That way getty (and the environment around it) becomes debuggable. And it is a minor change. Starting getty from inittab 'hangs' (loops) the system.

Is this fixed?

Yes and no. For obvious reasons, getty is very sensitive to errors. Any open or read/write error (almost) will cause it to exit, making init loop and the system appears dead. There are probably more ways to fix this than I can think of, but I guess the ’normal’ way is to have init detect the looping and throttle. Or somehow make getty more robust... Opening the device explicitly in getty and then replacing stdio 0,1,2 before the execv using dup2() works.

/bin/init opens the device initially as stated above. I would like to see specifically what was required to get this working. 'init' does the same thing replacing 0,1,2 using dup2.

I cannot figure out why dup2 is accepted but not the shell redirects.

That's the protection in the serial driver that prohibits multiple opens. 'dup2' just duplicates a file descriptor in the kernel without re-opening the device.

To summarize - the changes in the device open and dup2 calls you've made to get this to work may need to be changed in 'init', not 'getty'.

Maybe - I don’t know ELKS init, so I don’t have an opinion on that. Since I now know getty, that way seems simpler ...

Finally - sort of unrelated, but related anyway: elvis won’t run in a terminal window, complains about Screen too small. TERM is OK and /etc/termcap is available. I have that on my list, but if you immediately have a hunch about the problem, let me know.

—Mellvik

ghaerr commented 4 years ago

When everything works, your fixed getty works perfectly. unless everything on the hardware side is perfect.

What exactly is not working with my last PR? Is it just that it can't be debugged easily when the hardware side isn't perfect?

My adjustments are proposals, because as is, the serial/getty/inittab combination is impossible to debug

Agreed. I added a DEBUG option to both init and getty for this reason. Try turning them on (start with one at a time), it will help a lot. That was the only way I was able to see what was actually going on, finally.

The unix way is allowing multiple opens and then requesting exclusivity when required (and if possible).

Ok. The tty code could be changed on default to allow duplicate opens (at least for root), and the ktcp daemon could use O_EXCL on open to protect itself. I can add both those things.

while Linux seem to have deviated with mingetty.

I'm not that familiar with mingetty, and ELKS has one too. I'd say we should probably stick with getty for now and just get the few things needed to get this working.

Since I now know getty, that way seems simpler ...

Is there something that needs to be added to getty? What exactly?

Thanks for your testing, we're almost there!

ghaerr commented 4 years ago

elvis won’t run in a terminal window, complains about Screen too small. TERM is OK and /etc/termcap is available.

What do you mean 'terminal window'? You mean from a /dev/ttyS0 line? Or from ELKS while connected to another system using 'miniterm'?

I suspect the reason may have to do with an ioctl to get the window size not being implemented on a non-/dev/ttyX line. Let me know more details and we'll figure it out.

ghaerr commented 4 years ago

Finally - sort of unrelated, but related anyway: elvis won’t run in a terminal window, complains about Screen too small. TERM is OK and /etc/termcap is available. I have that on my list, but if you immediately have a hunch about the problem, let me know.

I spent some time tracking this down... /bin/init calls getty which calls login. Login sets the USER=, SHELL=, HOME= and TERM=ansi environment variables, then execs the shell specified in /etc/passwd. There is no other way that elvis gets the LINES/COLS. None of SIGWINCH nor the TIOCGWINSZ ioctl are implemented in ELKS. If login doesn't run, elvis won't work unless you set TERM=ansi explicitly beforehand.

Thus, I think your issue is that elvis is being run when the entire init->getty->login->shell sequence didn't happen. When it happens again, run printenv and take a look.

Mellvik commented 4 years ago

@ghaerr, This is a normal (full) login. The environ is set normally. Let me know if there is anything specific to test to narrow it down. Screenshot to be uploaded separately. I don't think they're related, but FWIW: a stty problem report is coming separately.

--Mellvik

  1. apr. 2020 kl. 04:25 skrev Gregory Haerr notifications@github.com:

 Finally - sort of unrelated, but related anyway: elvis won’t run in a terminal window, complains about Screen too small. TERM is OK and /etc/termcap is available. I have that on my list, but if you immediately have a hunch about the problem, let me know.

I spent some time tracking this down... /bin/init calls getty which calls login. Login sets the USER=, SHELL=, HOME= and TERM=ansi environment variables, then execs the shell specified in /etc/passwd. There is no other way that elvis gets the LINES/COLS. None of SIGWINCH nor the TIOCGWINSZ ioctl are implemented in ELKS. If login doesn't run, elvis won't work unless you set TERM=ansi explicitly beforehand.

Thus, I think your issue is that elvis is being run when the entire init->getty->login->shell sequence didn't happen. When it happens again, run printenv and take a look.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Mellvik commented 4 years ago

BTW - this is from a Virtualbox connection via netcat. In my tribe language, this is a 'terminal window'. I haven't checked why we have the extra 'init's - coming back to that.

Skjermbilde 2020-04-05 kl  10 51 05
Mellvik commented 4 years ago

@ghaerr, just some clarifications:

When everything works, your fixed getty works perfectly. unless everything on the hardware side is perfect.

What exactly is not working with my last PR? Is it just that it can't be debugged easily when the hardware side isn't perfect?

Getty works fine, there is nothing wrong with the PR. It’s just not helpful. My adjustments are proposals, because as is, the serial/getty/inittab combination is impossible to debug

Agreed. I added a DEBUG option to both init and getty for this reason. Try turning them on (start with one at a time), it will help a lot. That was the only way I was able to see what was actually going on, finally.

The unix way is allowing multiple opens and then requesting exclusivity when required (and if possible).

Ok. The tty code could be changed on default to allow duplicate opens (at least for root), and the ktcp daemon could use O_EXCL on open to protect itself. I can add both those things.

Thanks - appreciated. while Linux seem to have deviated with mingetty.

I'm not that familiar with mingetty, and ELKS has one too. I'd say we should probably stick with getty for now and just get the few things needed to get this working.

I agree. Since I now know getty, that way seems simpler ...

Is there something that needs to be added to getty? What exactly?

As suggested in the previous message, adding the ability to run getty from the command line turns it into a useful tool for debugging serial connections. A simple test of ppid != 1, or maybe just filename(stdin) != argv[1] and then open argv[1] does it. My version of getty does that - wrapped by the DEBUG #if (using the ppid!=0 method). Thanks for your testing, we're almost there!

Yes, this is very encouraging. —Mellvik

ghaerr commented 4 years ago

As suggested in the previous message, adding the ability to run getty from the command line turns it into a useful tool for debugging serial connections.

Agreed. Please post your modified getty.c here, along with exactly how you want to run it for debugging serial connections and I'll look at it. We need to be careful as there are two other serial line issues not debugged: the init recycling an elvis not running.

ghaerr commented 4 years ago

For the elvis issue and your screenshot: Thanks, very strange.

Lets debug the multiple init processes first - this is definitely because an /etc/inittab process died or failed to start and init is re-running it. Please send over a screenshot of your /etc/inittab or post it. You might try running init in DEBUG mode first, to get a better idea of what is failing.

Mellvik commented 4 years ago

[moved from #523] @Mellvik:

I'm not sure this is a problem, but it makes me suspicious. Serial login testing means among other things running ps(1) frequently. I'm noticing process numbers increasing fast while I'm not running many processes. Also, most of the time I see extra init processes in the ps list (see screen dump).

@ghaerr:

This is most definitely a problem. The reason you don't see anything is that I fixed the process IDs from going negative so the system continues without you noticing. The issue is exactly the same as before - init is waiting on an inittab child to exit, and restarting it... very quickly. I suspect this is your getty line, but don't know what your /etc/inittab looks like.

I've added debugging to init so this can be traced - set DEBUG 1 in init.c and recompile, that will show the offending line and we can debug from there. There are lots of reasons that a getty line could be failing, I'm wondering whether /dev/ttyS0 is already in use, as that would cause the open to fail.

Mellvik commented 4 years ago

That's the strange thing: I'm diagnosing this from the serial line login window. If you look closely at the screen clip you can see that there are logins (shells) @ two different terminals. Also, I'm using the distribution inittab unmodified with the exception of the comment removed for ttyS0.

I'll test more with init debug on. -Mellvik

ghaerr commented 4 years ago

Beware, 'init' may need to be debugged from the console (/dev/tty1). There is some code in main() that opens DEVTTY (tty1). I don't recommend changing that to run off /dev/ttyS0, as that will cause the open to fail due to multiple opens.

If you think that perhaps /dev/ttyS0 multiple opens being prohibited might be the cause of all this mess, comment out lines 258-259 in elks/arch/i86/drivers/char/serial.c - that will end the problem of serial line multiple opens.

Mellvik commented 4 years ago

Thanks for the hint @ghaerr. Actually a useful combination - work on the terminal, messages on the console :-)

I'll make the serial.c change to eliminate that possibility. Right now it looks like the runaway init is on physical HW only, not on VirtualBox (& serial via netcat).

—Mellvik

  1. apr. 2020 kl. 18:44 skrev Gregory Haerr notifications@github.com:

Beware, 'init' may need to be debugged from the console (/dev/tty1). There is some code in main() that opens DEVTTY (tty1). I don't recommend changing that to run off /dev/ttyS0, as that will cause the open to fail due to multiple opens.

If you think that perhaps /dev/ttyS0 multiple opens being prohibited might be the cause of all this mess, comment out lines 258-259 in elks/arch/i86/drivers/char/serial.c - that will end the problem of serial line multiple opens.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-609445816, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOBATDZ3XF6XP5ABUYLRLCYP5ANCNFSM4L3U6I7A.

Mellvik commented 4 years ago

@ghaerr:

Please post your modified getty.c here, along with exactly how you want to run it for debugging serial connections and I'll look at it.

Here we go:

diff --git a/elkscmd/sys_utils/getty.c b/elkscmd/sys_utils/getty.c
index 4403e7e5..9060b7d5 100644
--- a/elkscmd/sys_utils/getty.c
+++ b/elkscmd/sys_utils/getty.c
@@ -1,5 +1,5 @@
 /*
- * elkscmd/sysutils/getty.c
+ * elkscmd/sys_utils/getty.c
  *
  * Copyright (C) 1998 Alistair Riddoch <ajr@ecs.soton.ac.uk>
  *
@@ -67,7 +67,7 @@

 char * progname;
 char   Buffer[64];
-int    ch, col = 0, fd;
+int    ch, col = 0, fd, tty;

 void consolemsg(const char *str, ...)
 {
@@ -214,21 +214,48 @@ int main(int argc, char **argv)
     int n;
     speed_t baud = 0;
     struct termios termios;
+    pid_t ppid;

     progname = argv[0];
     signal(SIGTSTP, SIG_IGN);      /* ignore ^Z stop signal*/

-    if (argc < 2 || argc > 3) {
+    ppid = getppid();          /* if parent is init, usage() makes no sense */
+    if (ppid > 1 && (argc < 2 || argc > 3)) {
    consolemsg("Usage: %s device [baudrate]\n", argv[0]);
    exit(3);
     }

-    if (argc == 2) debug("'%s'\n", argv[1]);
-    else if (argc == 3) {
+    if (argc == 2) {
+   debug("info -- arg 2 is '%s'\n", argv[1]);
+    } else if (argc == 3) {
    baud = atol(argv[2]);
    debug("'%s' %ld\n", argv[1], baud);
     }

+    if (ppid > 1) {
+#if DEBUG
+       tty = open(argv[1], O_RDWR);
+       if (tty < 0) {
+       consolemsg("cannot open terminal %s\n", argv[1]);
+       exit(3);
+       }
+
+       debug("reopen %s\n", argv[1]);
+       close(0); close(1); close(2); /* close inherited stdio */
+       if (dup2(tty, STDIN_FILENO) != STDIN_FILENO ||  \
+       dup2(tty, STDOUT_FILENO) != STDOUT_FILENO ||  \
+       dup2(tty, STDERR_FILENO) != STDERR_FILENO) {
+       consolemsg("Cannot reopen stdio - error %d\n", errno);
+       exit(3);
+   }
+#else
+   fputs("Enable DEBUG to start getty from the command line\n", stderr);
+       /*close(tty);       crashes getty */
+   exit(1);    
+
+#endif /* DEBUG */
+    }
+
     fd = open(ISSUE, O_RDONLY);
     if (fd >= 0) {
    put('\n');
@@ -303,7 +330,10 @@ int main(int argc, char **argv)
                state(Host);
                break;
            case 'L':           /* Line used */
-               if (argc > 1) {
+               if (argc > 1) { /* this test is bull since 
+                        * we won't get here unless
+                        * argc > 1 
+                        */
                ptr = rindex(argv[1],'/');
                if (ptr == NULL)
                    ptr = argv[1];
@@ -347,6 +377,7 @@ int main(int argc, char **argv)

     /* setup tty termios state*/
     baud = convert_baudrate(baud);
+    debug("setting baudrate %d\n", baud);
     if (tcgetattr(STDIN_FILENO, &termios) >= 0) {
         termios.c_lflag |= ISIG | ICANON | ECHO | ECHOE | ECHONL;
         termios.c_lflag &= ~(IEXTEN | ECHOK | NOFLSH);
@@ -369,7 +400,7 @@ int main(int argc, char **argv)
    state("login: ");
    n=read(STDIN_FILENO,Buffer,sizeof(Buffer)-1);
    if (n < 1) {
-       debug("read fail errno %d\n", errno);
+       debug("read fail on stdin, errno %d\n", errno);
        if (errno != -EINTR)
        exit(1);
        continue;
@@ -378,6 +409,7 @@ int main(int argc, char **argv)
    while (n > 0)
        if (Buffer[--n] < ' ')
        Buffer[n] = '\0';
+   debug("Calling login: %s\n", Buffer);
    if (*Buffer) {
        char *nargv[3];

--Mellvik

ghaerr commented 4 years ago

Thanks @Mellvik.

Other than the debug statements to see what is going on, these mods are only to get getty to run from the shell, correct?

No modifications are needed to run on ttyS0 from init?

Mellvik commented 4 years ago

Yes, that's right @ghaerr.

There is also a message if started from the command line with DEBUG off, informing that DEBUG needs too be set in order to run like that.

—Mellvik

  1. apr. 2020 kl. 16:32 skrev Gregory Haerr notifications@github.com:

Thanks @Mellvik https://github.com/Mellvik.

Other than the debug statements to see what is going on, these mods are only to get getty to run from the shell, correct?

No modifications are needed to run on ttyS0 from init?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-610421903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOHQP42XRUPGTP3AIYLRLM2QLANCNFSM4L3U6I7A.

ghaerr commented 4 years ago

You want the additional capability to run from the shell permanent, right?

I'll test it a bit, and create a PR for it. There's a couple small issues, and DEBUG shouldn't be required to run it from the shell, if it issues appropriate error messages, agree?

Mellvik commented 4 years ago

Yes, I agree. I misread your previous mail. I can make that change if you'd like.

—M

  1. apr. 2020 kl. 16:54 skrev Gregory Haerr notifications@github.com:

You want the additional capability to run from the shell permanent, right?

I'll test it a bit, and create a PR for it. There's a couple small issues, and DEBUG shouldn't be required to run it from the shell, if it issues appropriate error messages, agree?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-610434815, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOC5HRC6PSI32SEVV33RLM5EHANCNFSM4L3U6I7A.

ghaerr commented 4 years ago

I'll put up a PR. There's still some bugs in your version, for instance, even though 'usage' is pointless when called from init, 'exit' still has be called. I want to review all the debug messages also, so that when we have to debug this again, it can be used!

I'm not completely happy with the debug messages from 'init' yet, but we're making progress. A system or console log would be nice, possibly combined with a kernel printk log using 'mesg', but we'll leave that for later.

Mellvik commented 4 years ago

OK. And yes, a printk would be nice at some point.

—M

  1. apr. 2020 kl. 17:17 skrev Gregory Haerr notifications@github.com:

I'll put up a PR. There's still some bugs in your version, for instance, even though 'usage' is pointless when called from init, 'exit' still has be called. I want to review all the debug messages also, so that when we have to debug this again, it can be used!

I'm not completely happy with the debug messages from 'init' yet, but we're making progress. A system or console log would be nice, possibly combined with a kernel printk log using 'mesg', but we'll leave that for later.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-610447662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOENQWPQT5VQEJN6PQ3RLM7XRANCNFSM4L3U6I7A.

Mellvik commented 4 years ago

New serial related issue, @ghaerr: ls -l /bin|more works fine on the console, not via the serial line. After the first page (first --More--), more(1) fails: more: : Bad file number

Repeatable on physical and virtual (Virtuabox).

Buffering issue?

--Mellvik

ghaerr commented 4 years ago

After the first page (first --More--), more(1) fails:

IIRC, you added an explicit open of /dev/tty to more, and my implementation of that could be buggy. Try reading from stdin or 0 instead and see whether that helps. I'm not sure the open of /dev/tty should occur in more anyways, that's reserved normally for input that absolutely should not be redirected, like typing in passwords.

ghaerr commented 4 years ago

you added an explicit open of /dev/tty to more

Opening /dev/tty is definitely the problem. It is failing and there is no check for -1. Try removing that extra open and using fd '1' instead and test it on serial and console. We can't use 0 as I suggested in my last comment, as more may be taking input from a pipe. But if the output is redirected, it shouldn't paginate anyways. You may need to remove the error printf in that case. After testing that case and the normal use case on console and serial, submit a PR, thanks!

There is also my bug that /dev/tty doesn't work on serial, but I can't fix that until I can figure out how to get serial working on QEMU. I also can't figure out how to attach an ELKS image to VirtualBox nor serial on it either.

Mellvik commented 4 years ago

@Ghaerr,

Yes, you're right, open /dev/tty does fail - but works in other cases. I'll take a look at the logic as well.

—M

  1. apr. 2020 kl. 21:36 skrev Gregory Haerr notifications@github.com:

you added an explicit open of /dev/tty to more

Opening /dev/tty is definitely the problem. It is failing and there is no check for -1. Try removing that extra open and using fd '1' instead and test it on serial and console. We can't use 0 as I suggested in my last comment, as more may be taking input from a pipe. But if the output is redirected, it shouldn't paginate anyways. You may need to remove the error printf in that case. After testing that case and the normal use case on console and serial, submit a PR, thanks!

There is also my bug that /dev/tty doesn't work on serial, but I can't fix that until I can figure out how to get serial working on QEMU. I also can't figure out how to attach an ELKS image to VirtualBox nor serial on it either.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-610580401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOHM6JV2YRYF3DUJDYLRLN6EFANCNFSM4L3U6I7A.

ghaerr commented 4 years ago

I added #526 that should allows getty to run from the shell.

Are there any other serial login problems other than elvis not running at this point? I'd like to close this and start new issues so things don't get too complicated. I have elvis on my list for when I can duplicate it.

Mellvik commented 4 years ago

Only stty - which I'm working on right now.

We don't need to keep this issue open for that.

—Mellvik

  1. apr. 2020 kl. 18:13 skrev Gregory Haerr notifications@github.com:

I added #526 https://github.com/jbruchon/elks/pull/526 that should allows getty to run from the shell.

Are there any other serial login problems other than elvis not running at this point? I'd like to close this and start new issues so things don't get too complicated. I have elvis on my list for when I can duplicate it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-611050926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOFIHOLGMA6GZSN67JLRLSPDPANCNFSM4L3U6I7A.

ghaerr commented 4 years ago

We don't need to keep this issue open for that.

Ok - what is the difference between what you're working on for that and your separate stty issue #522?

ghaerr commented 4 years ago

Oops - misread your comment. Go ahead and close this and we can keep #522 open, thanks.

ghaerr commented 4 years ago

@Mellvik,

Now that I'm able to run ELKS from serial console and use it a bit, I've finally seen how the serial port is buggy (as referenced in #454 by @pawosm-arm). When regularly typing commands into the shell, I notice that on occasion it appears my keyboard misses a keystroke. What's really happening is that the serial driver is dropping an input character. Have you experienced this?

I don't know where the problem is yet, but this happens quite often when typing.

Mellvik commented 4 years ago

That's great news, @ghaerr. The answer is no, I have not experienced character loss at all. Not on virtualbox, not on physical. I'm running physical at 19200. That said, I have not pushed any limits. By experience, 19200 is the highest the klunkers can go w/o hw handshake, but I'm assuming you're on virtual....

Typing never looses any input... (is there another process killing the system?)

-M

  1. apr. 2020 kl. 16:48 skrev Gregory Haerr notifications@github.com:

 @Mellvik,

Now that I'm able to run ELKS from serial console and use it a bit, I've finally seen how the serial port is buggy (as referenced in #454 by @pawosm-arm). When regularly typing commands into the shell, I notice that on occasion it appears my keyboard misses a keystroke. What's really happening is that the serial driver is dropping an input character. Have you experienced this?

I don't know where the problem is yet, but this happens quite often when typing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ghaerr commented 4 years ago

It sounds to me you have a qemu problem!!

I don't think this is a qemu problem. Yes, it could be another process is eating keystrokes, but this has never happened before, and it doesn't happen when typing the first character, only when typing quickly and only occasionally. You may not notice it if you don't type quickly, and at first, I thought I had not hit the keyboard hard enough, and just retyped the character.

I'm debugging all of it now, I think it's an interrupt servicing issue that also is the reason for #454 losing serial mouse data. We'll see.

Turns out /bin/init has a huge bug while parsing /etc/inittab - when a getty line happens to cross a 256-byte boundary in /etc/inittab, the bug shows up as firing off 3 serial getty's at the same time. It's verified and has been happening the whole time, which is why ps still showed processes running occasionally - these were failed getty's because the serial port couldn't be opened twice. I'm fixing all of that too.

I've got most of it all working, including the /dev/tty bug. But running vi prints screen size too small and crashes the system. Very strange!

ghaerr commented 4 years ago

@Mellvik,

PR #532 fixes all issues brought up here. Please test, hopefully we can close this finally!

ghaerr commented 4 years ago

Hello @Mellvik and @georgp24,

I just tested running the shell on the serial port (ttyS0) after the 'linenoise' enhancements were added with PR #536, and the shell prompt no longer displays on a new line when Enter is typed. So, you'll need to set LINENOISE to 0 in ash/shell.h until this is debugged. I haven't had time to look at the behavior on /dev/tty1, will fully review but may not have time this weekend.

ghaerr commented 4 years ago

Hi @Mellvik and @pawosm-arm,

Just after finishing the latest serial/tty enhancement in PR #553, I realized a big reason ELKS can't keep up with serial input - when in "raw" mode, read will never return the asked-for byte count, but instead always returns just a single byte (or nothing, for when there is no data). This means that the mouse program can't ever catch up to moving the mouse around a lot, nor can cat fill its 4096-byte buffer each read. Both cat and mouse (no pun intended, really) always get a single byte regardless of what they ask for. Thus, it is no surprise that the 512-byte kernel tty input queues drop data, as it is taking forever to get the received data from kernel to user programs.

I am still very interested in testing the new driver for "overrun" kernel messages, as that indicates the problem that the interrupt service routine can't empty the UART fast enough into the kernel input queue.

I hope to have another PR that correctly implements read shortly, which should increase throughput heavily.

Mellvik commented 4 years ago

@ghaerr, FWIW - vi() has been working fine on serial since the 'screen too small' problem was solved.

--M

ghaerr commented 4 years ago

vi() has been working fine on serial since the 'screen too small' problem was solved.

Sounds like you're running a 25x80 emulation. PR #553 allows querying of larger window sizes; e.g. 60 x 90 window on using macOS Terminal. If the terminal emulator doesn't support the screen size (DSR) request, it reverts back to guessing 25x80.

Mellvik commented 4 years ago

I do (25x80) - and thanks for the improvement.

—M

  1. apr. 2020 kl. 17:06 skrev Gregory Haerr notifications@github.com:

vi() has been working fine on serial since the 'screen too small' problem was solved.

Sounds like you're running a 25x80 emulation. PR #553 https://github.com/jbruchon/elks/pull/553 allows querying of larger window sizes; e.g. 60 x 90 window on using macOS Terminal. If the terminal emulator doesn't support the screen size (DSR) request, it reverts back to guessing 25x80.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-614095608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOECKZW5PNNSJ4F35RDRMXEP5ANCNFSM4L3U6I7A.

Mellvik commented 4 years ago

init() related observations:

  1. If a serial line is enabled in inittab and the device does not exist, init loops, seemingly continuously trying to open that line. There are no messages on the console. (Physical machine and VirtualBox).
  2. If a serial line is enabled in inittab, and does exist, but is not available (example: VirtualBox, serial is opening a socket and waiting for a connect, from, say, netcat), init hangs.

--Mellvik

Mellvik commented 4 years ago

While continuing serial testing (which has involved some hardware debugging recently), I propose the following changes: In irqtab.S - it may make sense to allocate irq5 to COM3 if the XT HD is not present:

--- a/elks/arch/i86/kernel/irqtab.S
+++ b/elks/arch/i86/kernel/irqtab.S
@@ -27,8 +27,12 @@ ds_kernel:
 #else
 #define MCD 0
 #endif
-#ifdef CONFIG_CHAR_DEV_RS
-#define MRS 0x18
+#if defined(CONFIG_CHAR_DEV_RS)
+# if defined(CONFIG_BLK_DEV_HD)
+# define MRS 0x18
+# else
+# define MRS 0x38 /* add 3rd COM line interrupt */
+# endif
 #else
 #define MRS 0
 #endif
@@ -127,8 +131,8 @@ _irq4:                      // COM1
        call    _irqit
        .byte   4
 #endif
-#ifdef CONFIG_BLK_DEV_HD
-_irq5:                 // XT HD
+#if defined(CONFIG_BLK_DEV_HD) || defined(CONFIG_CHAR_DEV_RS)
+_irq5:                 // XT HD or COM 3
        call    _irqit
        .byte   5
 #endif

Also, in serial.c the chip detection logic from wikibooks has a bug that prevents it for recognizing the 16550. It's worth noting that the 16550 chip has a FIFO, but its functionally unstable and cannot be used. So, IIR bit 7 indicates the presence of a FIFO, bit 6 indicates that the FIFO is useable (16550A and up), bit 5 means 16570. I propose the following diff to take care of this:

--- a/elks/arch/i86/drivers/char/serial.c
+++ b/elks/arch/i86/drivers/char/serial.c
@@ -62,7 +62,6 @@ static struct serial_info ports[NR_SERIAL] = {
 static char irq_port[NR_SERIAL] = { 3, 1, 0, 2 };

 static unsigned int divisors[] = {
-    0,                         /*  0 = B0      */
     2304,                      /*  1 = B50     */
     1536,                      /*  2 = B75     */
     1047,                      /*  3 = B110    */
@@ -115,12 +114,14 @@ static int rs_probe(register struct serial_info *sp)

     /* then read FIFO status*/
     status = inb_p(sp->io + UART_IIR);
-    if (status & 0x40) {
-       if (status & 0x80) {            /* FIFO enabled*/
-           if (status & 0x20)          /* 64 byte FIFO enabled*/
-               type = ST_16750;
-           else type = ST_16550A;
-       } else type = ST_16550;
+    //printk("stat 0x%02x ", status);
+    if (status & 0x80) {               /* FIFO available*/
+       if (status & 0x20)              /* 64 byte FIFO enabled*/
+           type = ST_16750;
+       else if (status & 0x40)         /* 16 byte FIFO OK */
+           type = ST_16550A;
+        else 
+           type = ST_16550;            /* Non-functional FIFO */
     } else {
        /* no FIFO, try writing arbitrary value to scratch reg*/
        outb_p(0x2A, sp->io + UART_SCR);
@@ -128,6 +129,7 @@ static int rs_probe(register struct serial_info *sp)
            type = ST_16450;
        else type = ST_8250;
     }
+    //printk("\n");
     sp->flags = SERF_EXIST | type;

     /*
@@ -367,7 +369,7 @@ static int rs_open(struct tty *tty)
 #define UART_FCR_ENABLE_FIFO4  (UART_FCR_ENABLE_FIFO | 0x40)
     /* enable FIFO with 8 byte trigger */
 #if FIFO
-    if ((port->flags & SERF_TYPE) > ST_16450)
+    if ((port->flags & SERF_TYPE) > ST_16550)
        outb_p(UART_FCR_ENABLE_FIFO8, port->io + UART_FCR);
 #endif

Finally, the marking on the chips may be 16550 even if it's really a 16550A, which makes the chip detection logic more crucial.

--Mellvik

pawosm-arm commented 4 years ago

In irqtab.S - it may make sense to allocate irq5 to COM3 if the XT HD is not present:

Nah, IRQ lines is a precious resource on every XT. And even if XT-IDE I'm using inside of both of my XT's does not claim any IRQ line, my Turbo-XT has all of them occupied as such: IRQ 0 - timer IRQ 1 - kbd IRQ 2 - network card IRQ 3 - 8250 COM2: IRQ 4 - 8250 COM1: IRQ 5 - SoundBlaster card IRQ 6 - FDD IRQ 7 - LPT1: Fortunately still, Paradise VGA also does not claim any IRQ.

For me, having hardware flow control is now the main feature missing in serial port driver.

(note that I can't use this XT beast for testing nowadays due to the lockdown. for now, Amstrad PC2086 is the only machine I can run ELKS on and it has only one serial port).

Mellvik commented 4 years ago

@pawosm-arm,

the proposed change doesn't in any way interfere with your requirements. It just changes the default config of IRQ5 from (effectively) 'unused' to 'something'. That 'something' is - incidentally - in line with your priorities: serial flow control.

-M

  1. apr. 2020 kl. 15:23 skrev pawosm-arm notifications@github.com:

In irqtab.S - it may make sense to allocate irq5 to COM3 if the XT HD is not present:

Nah, IRQ lines is a precious resource on every XT. And even if XT-IDE I'm using inside of both of my XT's does not claim any IRQ line, my Turbo-XT has all of them occupied as such: IRQ 0 - timer IRQ 1 - kbd IRQ 2 - network card IRQ 3 - 8250 COM2: IRQ 4 - 8250 COM1: IRQ 5 - SoundBlaster card IRQ 6 - FDD IRQ 7 - LPT1: Fortunately still, Paradise VGA also does not claim any IRQ.

For me, having hardware flow control is now the main feature missing in serial port driver.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-614650078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGODS7GZGW2A56LNWTPLRM4BGVANCNFSM4L3U6I7A.

pawosm-arm commented 4 years ago

That only shows how rigid current hardware specification options are. I guess a kind of source ports configuration table specifying per-serial-port I/O ports and IRQs in kernel config (similar to the one that can be filled for floppy disk drives when BIOS autodetect is disabled) would be more beneficial.

Mellvik commented 4 years ago

Which brings us right back to the discussion in #372.

ghaerr commented 4 years ago

In irqtab.S - it may make sense to allocate irq5 to COM3 if the XT HD is not present:

@Mellvik - is that the change you had to make to get COM3 to work? And did the solve the problem of getting that card to work with your original COM1 card?

BTW, "XT hard disk support", known now as "Experimental HW IDE hard disk support" is entirely broken. What I don't know is whether the BIOS uses IRQ 5 when asked to do hard disk I/O.

a kind of source ports configuration table specifying per-serial-port I/O ports and IRQs in kernel config would be more beneficial.

Agreed that since COM3, XT HD, and possibly future sound card may use IRQ 5, the irqtab.S modification shouldn't assume COM3 IRQ 5 when CONFIG_CHAR_DEV_RS is set. I'm thinking that a new config option specifying where to allocate IRQ 5 would be a better way to go.

Is there a need to reallocate the other IRQs in @pawosm-arm's list? Allowing a general re-allocation of any IRQ could be a lot of work, it seems only IRQ 5 is used for multiple devices generally.

Mellvik commented 4 years ago

@ghaerr,

  1. apr. 2020 kl. 17:55 skrev Gregory Haerr notifications@github.com:

In irqtab.S - it may make sense to allocate irq5 to COM3 if the XT HD is not present:

@Mellvik https://github.com/Mellvik - is that the change you had to make to get COM3 to work? And did the solve the problem of getting that card to work with your original COM1 card?

Yes, it did (I had a system board conflict too, long story). Now 3 serial ports run fine except for the known performance issues. BTW, "XT hard disk support", known now as "Experimental HW IDE hard disk support" is entirely broken. What I don't know is whether the BIOS uses IRQ 5 when asked to do hard disk I/O.

I'm experimenting in order to determine this right now. The system feels slower when IRQ 5 is reallocated, but the clunkers are so slow anyway, that it's hard to determine. More on that later. a kind of source ports configuration table specifying per-serial-port I/O ports and IRQs in kernel config would be more beneficial.

Agreed that since COM3, XT HD, and possibly future sound card may use IRQ 5, the irqtab.S modification shouldn't assume COM3 IRQ 5 when CONFIG_CHAR_DEV_RS is set. I'm thinking that a new config option specifying where to allocate IRQ 5 would be a better way to go.

Agree. There are many more or less esoteric uses for IRQ 5 and others that may be accommodated that way. Is there a need to reallocate the other IRQs in @pawosm-arm https://github.com/pawosm-arm's list? Allowing a general re-allocation of any IRQ could be a lot of work, it seems only IRQ 5 is used for multiple devices generally.

ghaerr commented 4 years ago

the chip detection logic from wikibooks has a bug that prevents it for recognizing the 16550. It's worth noting that the 16550 chip has a FIFO, but its functionally unstable and cannot be used.

Interesting. Can you point me to other information I can read about this?

So, IIR bit 7 indicates the presence of a FIFO, bit 6 indicates that the FIFO is useable (16550A and up), bit 5 means 16570.

Did you test this change, or are both just working from theory here? It seems the wikibooks article was written by a knowledgeable guy, and he used a tricky method, rather than the normal bit settings, for his detection algorithm.

Are we certain that the 16550 FIFO doesn't work? That's easily changed by testing for "> ST_16550", but we have yet to show that turning on the FIFO makes any difference. Correct?

What happens if you enable the FIFO on the 16550, anything bad? Presently, no serial driver code depends on the proper recognition after just enabling the FIFO, so I don't know how much this matters.

ghaerr commented 4 years ago

If a serial line is enabled in inittab and the device does not exist, init loops

Fixing this has been on my list for awhile. I'll take a pass at fixing it by having init not respawn again if it fails to open the device before running getty.

If a serial line is enabled in inittab, and does exist, but is not available (example: VirtualBox, serial is opening a socket and waiting for a connect, from, say, netcat), init hangs.

This is trickier, since there is no ELKS protection from serial ports used by multiple processes, although that may not be the problem. We could make the same change as above, to not respawn if the line open fails in init. I've never seen init hang, only loop repeatedly. You'll have to test with VB, I don't know where exactly init is hanging.

Finally, a good reason for init opening the device prior to calling getty! This would basically happen once at system startup, and allow an /etc/inittab for both tty1 and ttyS0 (and ttyS1)?

Mellvik commented 4 years ago

I agree, the wikibooks article is great, he's just somewhat unclear on this issue (in addition to the bug in the code which is an easy oversight). He says 'FIFO enabled but not functioning' - this is the 16550. Knowing this, and following the logic in the code, you see that it will never hit the 16550, because he's testing on bit 6, not bit 7.

Here's a quote from the 16550 manual that came with one of my serial cards: "The primary difference between these two parts (16550, 16550A) is in the operation of the FIFOs. The NS16550 will sometimes transfer extra characters when the CPU reads the RX FIFO. Due to the asynchronous nature of this failure, there is no workaround and the NS16550 should NOT be used in FIFO mode. The NS16550A has no problems operating in the FIFO mode and should be used on all new designs. The programmer should not the difference in the function of bit 6 in the Interrupt Identification Register (IIR). This bit is permanently at logical 0 in the NS16550. In the 16550A this bit will be set to 1 when the FIFOs are enabled."

The serial card I am using currently has two chips marked 16550, the identify (per the above) as 16550A. I have yet to test the operation of the FIFO, it's on the agenda.

—Mellvik

  1. apr. 2020 kl. 18:03 skrev Gregory Haerr notifications@github.com:

the chip detection logic from wikibooks has a bug that prevents it for recognizing the 16550. It's worth noting that the 16550 chip has a FIFO, but its functionally unstable and cannot be used.

Interesting. Can you point me to other information I can read about this?

So, IIR bit 7 indicates the presence of a FIFO, bit 6 indicates that the FIFO is useable (16550A and up), bit 5 means 16570.

Did you test this change, or are both just working from theory here? It seems the wikibooks article was written by a knowledgeable guy, and he used a tricky method, rather than the normal bit settings, for his detection algorithm.

Are we certain that the 16550 FIFO doesn't work? That's easily changed by testing for "> ST_16550", but we have yet to show that turning on the FIFO makes any difference. Correct?

What happens if you enable the FIFO on the 16550, anything bad? Presently, no serial driver code depends on the proper recognition after just enabling the FIFO, so I don't know how much this matters.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/515#issuecomment-614743805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOG7SLIWQ7BFB242TGLRM4T6LANCNFSM4L3U6I7A.