mikaku / Fiwix

A UNIX-like kernel for the i386 architecture
https://www.fiwix.org
Other
401 stars 32 forks source link

Fix fbconsole scrolling bug with vi #81

Closed ghaerr closed 5 months ago

ghaerr commented 5 months ago

Fixes incorrect framebuffer console last line displayed incorrectly when scrolling screen upward using ^D with vi, described in https://github.com/mikaku/Fiwix/issues/78#issuecomment-1953456433.

vi uses scrolling regions and the fbcon code didn't clear the last scrolled up line properly when a scrolling region was set.

During testing, another probably worse kernel problem was seen: if ^D is typed in rapid succession, say while scrolling /etc/termcap, vi output becomes scrambled. Not sure what this may be, but first guess is that it appears that possibly the write system call output (from vi to the console) becomes unsynchronized. This will require lots more testing to determine, but it is easily duplicated (both before and after this PR).

mikaku commented 5 months ago

Thank you for fixing this annoying bug!

During testing, another probably worse kernel problem was seen: if ^D is typed in rapid succession, say while scrolling /etc/termcap, vi output becomes scrambled. Not sure what this may be, but first guess is that it appears that possibly the write system call output (from vi to the console) becomes unsynchronized. This will require lots more testing to determine, but it is easily duplicated (both before and after this PR).

Yes, you're right. I think this is because the slow nature of the framebuffer graphics. It seems that with quick ^D typing the keyboard executes a new scroll function instance while the previous one didn't finalize yet.

I think that we need to introduce a lock mechanism to discard entering to the scroll function if it's already running. In this case is better to discard a ^D than wait until the resource is ready.

I'll take a look to it.

ghaerr commented 5 months ago

I think this is because the slow nature of the framebuffer graphics. It seems that with quick ^D typing the keyboard executes a new scroll function instance while the previous one didn't finalize yet.

I don't understand - isn't the kernel non-reentrant, and all application write calls complete or at least are queued properly before returning? Does fbcon run as a bottom-half task, then? I obviously don't know how Fiwix works in this regard.

In this case is better to discard a ^D than wait until the resource is ready.

I would disagree, dropping output is pretty annoying to users. If the lock can cause the application to sleep until the resource is ready, IMO it would be better. There are already enough UNIX problems with write not returning the passed count, and having to wrap it if less than the passed byte count is written.

mikaku commented 5 months ago

I don't understand - isn't the kernel non-reentrant, and all application write calls complete or at least are queued properly before returning? Does fbcon run as a bottom-half task, then? I obviously don't know how Fiwix works in this regard.

Screen scrolling is asynchronous because it's executed in the bottom half of the keyboard interrupt:

https://github.com/mikaku/Fiwix/blob/7fec9b0e336e39cf3bac113296ddd7757d583148/drivers/char/keyboard.c#L747-L750

Bottom halves execute with interrupts on because they have low priority. So that would explain why multiple ^D or ^U might mess the screen if the framebuffer is slow.

Also, we must avoid sleeping on a bottom half, that's the reason why extra scrolling requests must be discarded.

ghaerr commented 5 months ago

Bottom halves execute with interrupts on because they have low priority. So that would explain why multiple ^D or ^U might mess the screen if the framebuffer is slow.

I see - well that definitely explains why the display is messed up.

Also, we must avoid sleeping on a bottom half,

Got that.

that's the reason why extra scrolling requests must be discarded.

I like your idea of a lock while scrolling is in progress, but frankly don't understand why scrolling isn't just performed right away... discarding console output cannot be allowed for proper operation, so why not just wait for the scrolling to occur without any later execution? What would be the reason for delaying a scroll, given that it has to be done anyways, and thus introduces a complex synchronization issue to keep console display correct?

mikaku commented 5 months ago

but frankly don't understand why scrolling isn't just performed right away... discarding console output cannot be allowed for proper operation, so why not just wait for the scrolling to occur without any later execution? What would be the reason for delaying a scroll, given that it has to be done anyways, and thus introduces a complex synchronization issue to keep console display correct?

Interrupt handlers must be quick, just do a small thing and exit quickly because during this time all interrupts are blocked. That's the reason why exists the bottom halves. This is well explained in the book Understanding the Linux kernel (First edition, based on the Linux kernel 2.2).

If you do something long like scrolling the screen on a slow framebuffer inside an interrup handler, then your system will lost interrupts, will respond slowly and you'll lost completely the multitasking benefits.

During the disk I/O happens the same. The disk interrupt handler just awakes the process sleeping in the queue for a transfer and returns immediately. Then at a later time (in processor time) the bottom half will copy the buffer into the user space of the process.

In Fiwix though the later not occurs like this, it happens in the context of the process. That's one of the things I planned to change. I'd like to start using I/O queues because this hopefully will speed up disk accesses.

ghaerr commented 5 months ago

Thank you for your explanation. I completely understand the issues around hardware interrupt handling and how they need to be quick and not sleep, etc.

What I don't get yet is why the framebuffer itself (scrolling or just outputting characters) is an interrupt-driven operation. That is, what is the interrupt that occurs that requires framebuffer to be interrupt-driven? For serial output, the UART THRE register requires an interrupt when empty, for example. I am probably thinking more in line with ELKS, where the console output occurs in the context of the process doing the write, and there are no interrupts (other than the keyboard interrupt, of course) that affect console display. The current process waits until the display has been updated before returning from the syscall.

Nonetheless, it should be straightforward to sleep any process that calls code setting do_buf_scroll until it is zero, then have the bottom half wake up any waiting processes after the scroll completes, right?

I'd like to start using I/O queues because this hopefully will speed up disk accesses.

I might be starting to understand. In ELKS, when an interrupt occurs, the kernel is reentered with interrupts ENABLED, and the interrupt handler can disable interrupts if required. After the interrupt handler complete, no task switch is allowed if the kernel was reentered, but task switches are allowed if not (that is, the interrupt occurred from user mode). It sounds like with Fiwix interrupt handlers execute with interrupts disabled, and a BH has to be scheduled to do anything not strictly required at interrupt time?

In the current case of disk I/O complete, if the interrupt handling routine is running with interrupts enabled, can't it start the next disk I/O request (using CLI/STI when removing queue elements, etc) and then copy the results of the previous I/O without having to use a BH routine or I/O queues? I suppose not, if the interrupt routine is CLI and a BH is normally scheduled.

I am guessing that the Fiwix/Linux kernel interrupt handling is a bit different than what I am used to with ELKS. Sorry for all my confusion.

mikaku commented 5 months ago

You were right from the beginning. I misunderstood and confused you with my explanations about the scroll back console history, which is different than pressing ^D and ^U under Vim. The scroll back console history works by pressing Shift+PgUp and Shift+PgDn and indeed these must be executed asynchronously because this is a kernel feature.

What you were talking about is a user screen scrolling generated by different means, depending of the program that is handling your tty session.

I can't remember now the whole path that a simple ^D does inside the kernel. But I will think about it and why multiples ^D can mess the screen on a slow framebuffer.

Thanks for you keen eyes, and sorry for including confusion from my part.

ghaerr commented 5 months ago

During testing, another probably worse kernel problem was seen: if ^D is typed in rapid succession, say while scrolling /etc/termcap, vi output becomes scrambled.

OK, I deep-dived into this in a big way looking to see what might be the problem with the Fiwix framebuffer console display, after including the scroll region fix in this PR.

I chased down the following areas in the kernel, changed lots of things, learned lots of things, and found all these possible problem areas are working correctly:

I finally came to thinking that perhaps the problem isn't in the framebuffer console at all. And then progress was made. It was learned that vi is very tricky when it comes to keyboard input - if, during screen output other character is typed, for instance another ^D, vi somehow realizes this and starts redrawing other areas of the screen in order to speed the scroll up. I also learned that one can slow down vi output for debug purposes by turning on :set writedelay=1 (or :set wd=1).

After turning on vi write delay, I noticed that during scrolling, vi catches ^C and will accurately abort the current output, displaying an interrupt message on the bottom line, then continue filling in the scrolled region afterwards. This display also was sometimes corrupted.

Most important of all, I found that by setting the write delay to a small value (=1 microsecond), the scroll display bug completely disappears, with a minimal effect on scroll speed!!!

Doing some research on vi, I found here, here, and here that VIM 7.4 (used on Fiwix also) has known scroll glitch problems with ^D/^U. (None of them had a solution though). It is possible that a later version of VIM, perhaps 8.2+ does not have this problem (that's the version I have on macOS and it seems to work).

As a result of way too much testing, I now firmly believe this last remaining scrolling problem is not in Fiwix but in VIM. To fix it, the following line should be added to the next Fiwix distribution.

In file /usr/share/vim/vim74/vimrc, add the following line:

set writedelay=1     " fix vim 7.4 ^D/^U/^C scroll bug

Here's a screenshot of the working vimrc file (line 13). Note that /etc/vimrc apparently isn't being used, but it should be set the same way IMO:

vim fix

Conclusion: with this vimrc fix and this PR, no further console display problems can be seen.

mikaku commented 5 months ago

Wow!, you did an incredible effort here.

Yesterday night I went to sleep tracing mentally several times all the path that a ^D keystroke would do inside the kernel. And on every time I was trying to imagine if indeed there is some bug in the kernel that makes ^D somehow reentrant, and hence generates all the mess in graphics.

Today, the first thing that came up to my mind right before I put my foot out of the bed was the same keystroke tracing. I was really very concerned on that it morning, until I read your message.

I finally came to thinking that perhaps the problem isn't in the framebuffer console at all. And then progress was made. It was learned that vi is very tricky when it comes to keyboard input - if, during screen output other character is typed, for instance another ^D, vi somehow realizes this and starts redrawing other areas of the screen in order to speed the scroll up. I also learned that one can slow down vi output for debug purposes by turning on :set writedelay=1 (or :set wd=1).

Certainly, I would never thought that vi would be the cause of all this mess.

Most important of all, I found that by setting the write delay to a small value (=1 microsecond), the scroll display bug completely disappears, with a minimal effect on scroll speed!!!

Can't believe it!

As a result of way too much testing, I now firmly believe this last remaining scrolling problem is not in Fiwix but in VIM. To fix it, the following line should be added to the next Fiwix distribution.

Sure it will be. I'm taking notes on my ToDo list.

Conclusion: with this vimrc fix and this PR, no further console display problems can be seen.

You are a wizard.

ghaerr commented 5 months ago

tracing mentally several times all the path that a ^D keystroke would do inside the kernel. And on every time I was trying to imagine if indeed there is some bug in the kernel that makes ^D somehow reentrant, and hence generates all the mess in graphics.

I was at first concerned about whether the kernel might allow another write from vi (of display data) getting ahead of a previous write, before the first one completes (never happens). Then I was concerned that some of the display output might be getting discarded (nope). Thus the kernel deep dive through the write output process.

But after taking a very close look at vi output with write delay set, I imagined something else. Here is a guess at what I think is happening:

When vi does scrolling output, it divides the output into two writes: the initial scroll (using either a scroll region and/or cursor position and line feeds or reverse line feeds) which opens up space, and then the followup output which fills in that blank space. Immediately after the scroll, vi checks for more keyboard input (by polling), and if another scroll up/down is seen, instead of following up with the normal second line fill writes, it starts again and issues another scroll region output, thus scrolling faster, as the slower line-filling output is further delayed. Much the same happens with ^C, where the output is temporarily stopped (possibly using a TCFLUSH drain function, not sure), a status line shown, then screen filling afterwards.

I think the bug is that vi's attempt at being very tricky with multiplexing its output only works if the ^D/^U scroll command input is slow, that is, the second ^D comes well after first scroll write output. You can see this happening if you slow things down enough; when the bug occurs it always occurs in exactly the same way, with the right hand bottom cursor ruler x,y position getting scrolled improperly. This consistency was what led me to believe it was not the kernel after all!

I'm sure this could all be decided for sure by diving into the VI source code, but I think I'll just leave that alone and do something more useful :)

mikaku commented 5 months ago

When vi does scrolling output, it divides the output into two writes: the initial scroll (using either a scroll region and/or cursor position and line feeds or reverse line feeds) which opens up space, and then the followup output which fills in that blank space. Immediately after the scroll, vi checks for more keyboard input (by polling), and if another scroll up/down is seen, instead of following up with the normal second line fill writes, it starts again and issues another scroll region output, thus scrolling faster, as the slower line-filling output is further delayed.

Ah!, that would explain why I saw some times only half of screen (the upper normally) while scrolling fast inside Vim.

I'm sure this could all be decided for sure by diving into the VI source code, but I think I'll just leave that alone and do something more useful :)

You put an enormous effort here that will stay for posterity and might save time for other hobby OS developers in the same situation.

Good job!