mfld-fr / emu86

Intel IA16 emulator for embedded development
35 stars 6 forks source link

8254 PIT emulation problem #41

Closed ghaerr closed 3 years ago

ghaerr commented 3 years ago

Hello @mfld-fr,

While adding code for cursor positioning for ELKS BIOS console, as first step for #39, some very strange timing-related behavior has been found. I am pretty sure this is relating to the changes made in converting from the dummy IRQ 8 ELKS hack to more fully emulating the 8259 PIC.

I don't completely understand the new PIC emulation code, and have spent hours debugging this, when finally a direct hint was found: the ELKS kernel is executing printk("Unexpected interrupt: %u\n", i); in elks/arch/i86/kernel/irq.c for IRQ 0 (timer interrupt), but only under certain startup and loop timing scenarios. This improper "unexpected interrupt" seems to be occurring continuously and the bug manifests itself as the SDL and Emscripten ports never displaying an initial console on startup (all black or grey). This seems to be because he SDL subsystem is thrashed with incoming data and never gets time to update the SDL buffer.

A PR is forthcoming which will enable duplicating the problem - the PR implements the remaining BIOS console functions, but the very strange thing is that the bug only shows up when BIOS console (not Headless console) is set on ELKS. further, the problem disappears when the TIMER_MAX value is upped from 1500 to 20000 in timer-elks.c.

Development can only continue by setting all TIMER_MAX values to 20000, and as a result, the SDL and Emscripten ports are very, very slow. Ultimately, using a instruction timer loop rather than real-time may be a continuing problem for EMU86 in this area. The reason(s) why are involved, but related to the fact that ELKS Headless and BIOS consoles require kernel polling, set by a kernel timer callback, in order to read the keyboard, and this entire process, along with the timer 0 interrupt frequency, is essentially used by the SDL/Emscripten ports to pump the graphics subsystem event loop. We can discuss more on this later.

Can you check to see whether the new PIC code checks to see whether it checks the 8259 specific mask bit before generating an interrupt? This could help to solve this issue.

Thank you!

ghaerr commented 3 years ago

Here is a screenshot showing Emscripten backend, and using the INT 10h function 0x0E (text out). The actual printk may not be displayed because of SDL thrashing.

Screen Shot 2021-05-03 at 1 33 26 PM

The problem only reproduces when ELKS .config is modified to set BIOS console, AND a printk occurring after console initialization. Very strange:

diff emu86-rom.config .config
115,116c115,116
< # CONFIG_CONSOLE_BIOS is not set
< CONFIG_CONSOLE_HEADLESS=y
---
> CONFIG_CONSOLE_BIOS=y
> # CONFIG_CONSOLE_HEADLESS is not set
diff --git a/elks/init/main.c b/elks/init/main.c
index 5d1382fd..03709299 100644
--- a/elks/init/main.c
+++ b/elks/init/main.c
@@ -101,6 +101,7 @@ void INITPROC kernel_init(void)

     /* init direct, bios or headless console*/
     console_init();
+printk("12345\n");

 #ifdef CONFIG_CHAR_DEV_RS
     serial_init();

Given that the Unexpected interrupt does not occur unless the extra printk is in, I find the whole thing very strange. Thus, it is possible the problem is not related to IRQ/PIC handling, but the startup bug remains nonetheless, depending on the settings of TIMER_MAX set in timer-elks.c:

#ifdef __EMSCRIPTEN__
#define TIMER_MAX 1500
//#define TIMER_MAX 20000       // required for BIOS console
#elif SDL
#define TIMER_MAX 3000          // OK for emu86-rom.config only
//#define TIMER_MAX 20000       // required for BIOS console
#else
#define TIMER_MAX 20000
#endif

If TIMER_MAX is set above to 20000 for Emscripten, the system starts normally (and operates slow).

mfld-fr commented 3 years ago

I think all this trouble is coming not from the PIC emulation, but from the current design of the SDL console backend. Pumping the external events and pushing them to the SDL event queue, peeping the SDL event queue, and rendering characters are triggered by the BIOS INT 10h and 16h. In other words, the amount of SDL processing is driven by the code that use that interrupts, and not by the main loop.

For example, if the timer rate is increased, the ELKS BIOS console driver will query INT 16h more often, resulting in more processing in SDL keyboard emulation. Or, before ELKS setups an handler for INT8, the flood of "unexpected interrupt" print will cause more processing in SDL video emulation.

I would suggest to rework that backend to move the pumping, the peeping and the rendering in the main loop, executing that operations every N main loop iterations. That way you could better balance the processing between the instruction execution and the polling / rendering, with a fixed ratio between the timer rate and the refresh rate.

It means for example not to render the character immediately with the raster font on INT10h, but writing the character code in the emulated video memory (in 80x25 text mode as in real PC), and render that emulated video memory periodically in the main loop.

ghaerr commented 3 years ago

Thanks for your comments.

I would suggest to rework that backend to move the pumping, the peeping and the rendering in the main loop, executing that operations every N main loop iterations.

Good idea. I have done so and pushed the commit. It wasn't much change because the actual updating was already being delayed.

That way you could better balance the processing between the instruction execution and the polling / rendering, with a fixed ratio between the timer rate and the refresh rate.

Yes - this has cleaned things up considerably. The main loop now updates the SDL backend through a new con_update() function call every 10,000 instructions, and the IRQ 0/INT 8 timer now executes every 3,000 instructions.

With the above-referenced ELKS kernel printk("12345\n") IN, everything works quite well, and this fixed ratio seems to work well for stdio, sdl and browser versions, with no lag. Additionally, refactoring the code to the main loop now allows one to ^C out of the SDL version, when it appears to hang (see below), which is a big benefit.

However - with the 10,000 / 3,000 fixed ratio, and printk OUT (meaning running a non-modified ELKS kernel compiled for BIOS Console (not headless), the system now hangs in the regular stdio version!! Now, the stdio version is behaving identically to the SDL version, both appear to hang.

Very strange, but also shows the problem is not the SDL/Emscripten backend. Eliminating SDL and running stdio only, the system hangs showing the following:

Gregs-MBP2:emu86 greg$ ./elks.sh
info: load address E0000h
info: load file ../elks-gh/elks/arch/i86/boot/Image
info: file size=10000h
success: file loaded
info: load address 80000h
info: load file ../elks-gh/image/romfs.bin
info: file size=2C1A9h
success: file loaded

ELKS Setup <-- hangs here and ^C typed to exit hang
>c
E062:2B5C  EB FE              jmp     2b5c
E062:2B5C  EB FE              jmp     2b5c
E062:2B5C  EB FE              jmp     2b5c
E062:2B5C  EB FE              jmp     2b5c
E062:2B5C  EB FE              jmp     2b5c
E062:2B5C  EB FE              jmp     2b5c
E062:2B5C  EB FE              jmp     2b5c
E062:2B5C  EB FE              jmp     2b5c
E062:2B5C  EB FE              jmp     2b5c

Segment E062 is the kernel text segment (found when running with printk IN). Looking up 2B5C in system.map, it turns out we're in the middle of the ELKS halt() procedure. Halt is only called when the root filesystem can't be mounted, or after a call to panic, both of which should have displayed data to stdio, but didn't. I suspect the reason is that the BIOS Console hasn't been initialized, so early panics and printk don't display.

Or, before ELKS setups an handler for INT8, the flood of "unexpected interrupt" print will cause more processing in SDL video emulation.

Either we are finding a new ELKS kernel bug, or there's a problem in EMU86 IRQ emulation. It is possible that the EMU86 timing is such that the "Unexpected interrupt 0" happens, but this has never happened on real hardware. In the current fixed timing ratio, the "Unexpected interrupt 0" is no longer being displayed. We can't currently get a stack trace on EMU86, I suspect a kernel panic has occurred ? This could be caused by my first guess, which is that EMU86 is improperly generating an interrupt early on in the kernel boot. Not sure!

It means for example not to render the character immediately with the raster font on INT10h, but writing the character code in the emulated video memory (in 80x25 text mode as in real PC), and render that emulated video memory periodically in the main loop.

This isn't actually necessary, as the SDL pixel buffer is essentially saving the video memory contents and its update is already delayed until 10,000 instructions have executed. That instruction count is now fixed though, where before it was variable, which is an improvement.

ghaerr commented 3 years ago

@mfld-fr:

Here's a .zip of the Image, romfs.bin and system.map ELKS kernel that shows the problem running the pushed commit EMU86 in stdio mode. (The ELKS kernel is standard and compiled with BIOS Console, no other changes).

Archive.zip

ghaerr commented 3 years ago

Update: I have now used EMU86 stdio mode to further debug this issue. Using the above .zip image, ELKS boots if the timer countdown is increased from 3,000 to 20,000. It fails in the normal mode of 10,000 main / 3,000 timer.

The failure is occurring in start_kernel() -> init_irq() -> request_irq(TIMER_IRQ, ...) when checking to ensure the current action is "default_handler", meaning the IRQ has not been yet requested:

int request_irq(int irq, void (*handler)(int,struct pt_regs *,void *), void *dev_id)
{
    register struct irqaction *action;
    flag_t flags;

    irq = remap_irq(irq);
    if (irq < 0)
        return -EINVAL;

    action = irq_action + irq;
    if (action->handler != default_handler)
        return -EBUSY;
...
void INITPROC irq_init(void)
{
    flag_t flags;

    stop_timer();

    /* Old IRQ 8 handler is nuked in this routine */
    irqtab_init();                      /* Store DS */

    /* Set off the initial timer interrupt handler */
    if (request_irq(TIMER_IRQ, timer_tick, NULL))
        panic("Unable to get timer");

    /* Re-start the timer only after irq is set */
    enable_timer_tick();
...

The routine is returning -EBUSY, and then panic is called which calls halt() then busy loops.

I have seen this rarely before when the kernel is re-entered, sometimes through a NULL pointer jump. I can't see any reentry.

My current theory is that when the printk("12345\n") is IN, or the timer loop value is high (20,000), then the system takes long enough to initialize some variable that is otherwise causing this problem when the printk is OUT or the timer interrupt loop value is low (3,000).

Also, irqtab_init disables interrupts while copying the new interrupt table into place, then re-enables them just before returning. Inspection of the kernel code above shows there is a time when interrupts are enabled between the return of irqtab_init and the call to enable_timer_tick(), which turns on the hardware timer. A timer interrupt occurring this period could result in the default handler being called, which will print "Unexpected interrupt", which we've seen before in EMU86 but not on real hardware. I've tried disabling interrupts in this critical section but still doesn't fix it, nor does that answer the problem of how request_irq got called twice.

The irqtab_init routine also stores the original IRQ 0 entry point, which is then called every 5th time by the _irqit interrupt entry handler. But I see EMU86 already sets a default vector and issues an EOI, which is correct.

The kernel then calls the following routine to enable the hardware timer:

void enable_timer_tick(void) 
{
    /* set the clock frequency */
    outb (TIMER_MODE2, (void*)TIMER_CMDS_PORT);
    outb (TIMER_LO_BYTE, (void*)TIMER_DATA_PORT);       /* LSB */
    outb (TIMER_HI_BYTE, (void*)TIMER_DATA_PORT);       /* MSB */
}

Since EMU86 does not wait until these commands are sent to the timer, it seems that now that we're emulating the 8259 controller the timer chip may have to be emulated also?

Do you think this might be the problem? I can't think of a way to turn off the IRQ 0 interrupt in EMU86 to test; disabling interrupts, as mentioned above, doesn't solve the boot problem.

ghaerr commented 3 years ago

@mfld-fr: I implemented a first pass at enabling and disabling the 8253/8254 timer chip in #42 and that fixes all the issues discussed above. The hardware timer was occurring at times when it was not enabled and that was causing the problems.

The stdio, sdl and emscripten versions now all work quite well, and use a fixed timing ratio requiring no modifications between platforms.

The current implementation in #42 is bare bones and works for ELKS. I'll let you handle any further enhancements to it, thanks!

mfld-fr commented 3 years ago

So as far as I understand, apart the improvement of the main loop and console backend processing, you found a missing enable / disable function in the PIT emulation for ELKS need. Nice spot !

But I think it also reveals a flaw in ELKS IRQ handling : if another IRQ is in use at ELKS initialization time, that the kernel is not aware of, it would break the feature using that IRQ by clearing the related vector, in addition to potentially crash whenever this IRQ frequency is high.

A better option would be to change the vector only when adding a new handler, and to disable the interrupt only during the time of the vector change.

ghaerr commented 3 years ago

But I think it also reveals a flaw in ELKS IRQ handling

Yes, but I don't quite see how exactly?

if another IRQ is in use at ELKS initialization time, that the kernel is not aware of, it would break the feature using that IRQ by clearing the related vector, in addition to potentially crash whenever this IRQ frequency is high.

At boot, EMU86 presets the IRQ 0 vector to F000:1000h, where an EOI, IRET is coded. Then ELKS starts, and it disables interrupts in irqtab_init and copies a new vector for IRQ 0 (and saves the old one). This new table points IRQ 0 to 'default_handler'. Interrupts are then enabled. 'request_irq' then rewrites the action handler to 'timer_tick' and disables interrupts during the vector change.

The 'default_handler' action ends up calling printk("Unexpected interrupt"). I thought I had enhanced ELKS kernel printk such that it will work very early, even by calling 'early_putchar' so that printk works throughout kernel startup. But it seems this is not happening, or may be buggy with BIOS console, so I don't really understand the full problem yet.

Can you clarify where you are seeing the bad ELKS code? I don't see it ever actually clearing the related vector.

ghaerr commented 3 years ago

@mfld-fr :

I see some possibilities for where the ELKS kernel problem may be. Although the kernel properly handles reseting an interrupt vector and disables interrupts while doing so, the hardware interrupt vector is not actually set to 'default_handler' but instead always set to the general kernel interrupt handler entry point '_irqit'. This fancy piece of code has been discussed and analyzed separately, in both ELKS and MON86 projects.

Looking at '_irqit', there are a couple items that could cause problems with an early interrupt:

mfld-fr commented 3 years ago

Let me try to describe the error case : suppose before ELKS boots, the BIOS, more likely a ROM extension on a sound board, installs an handler for IRQ 11, and starts the sound engine that uses that IRQ 11 intensively, for example to welcome musically the user at PC boot :smile:.

I would like to play my own music in ELKS too, so I write a driver for that sound board, activate the option CONFIG_IRQ11 and insert a call to request_irq() at driver initialization. When ELKS boots, while the sound engine is still playing the default welcome music and firing continuously the IRQ 11, irqtab_init redirects the vector to the default handler, as the driver is not initialized yet.

The welcome music would be interrupted, and until the ELKS sound driver intialization, I would have a flood of "unexpected interrupt". Some time later on, the ELKS sound driver installs its own handler and reinitializes the sound engine. If not lucky, the sound engine could not ever be recovered as it was left in a bad state, because there was, during a short period, no correct interrupt handling to feed it with music data or whatever control data.

A most robust sequence would be : ELKS does not touch any vector, until the drivers come. A driver first resets the device and disables the interrupt generation, then requests ELKS to redirect the vector to its interrupt handler, then enables again the interrupt generation.

ELKS generic interrupt code (or PIC specific code) cannot enable / disable the interrupt generation, as doing this depends on the generating device HW, and masking the interrupt at the PIC level does not stop the device to work in the background and to fire interrupts.

For example, the code that enables / disables the PIT is not in the right place in the irq_init function. The code related to the PIT should be moved to a 'PIT driver', with its own knowledge of the PIT HW (including the fact that the PIT interrupt is mapped in IRQ 0). This would also allow to use another PIT than the 8253 / 8254 one. Same for the PIC related code, that should be moved to a 'PIC driver', to allow to use another PIC than the 8259 one.

ghaerr commented 3 years ago

Yes, I can see what you are saying, that the kernel could be better organized. For the second case, concerning the fact that the PIC and PIT code are not grouped into their own "drivers" or source files, I particularly agree. One of the reasons I'm interested in the EMU86 project is that it is helping me see exactly what the ELKS kernel is dependent upon, for portability reasons. And we've just identified another issue, the hard way!

However, for the first case, where sound is playing and we end up with "unexpected interrupts" for a period - that's a small price to pay for a kernel which intends to take over the entire system, and handling theses cases, as well as the special IRQ 0 original handler stub, would take up more code for little gain. AND - we still don't know exactly where the bug is with the kernel, since we're no longer getting "unexpected interrupts" with early interrupts, just a panic/halt.

For example, the code that enables / disables the PIT is not in the right place in the irq_init function. The code related to the PIT should be moved to a 'PIT driver', with its own knowledge of the PIT HW (including the fact that the PIT interrupt is mapped in IRQ 0). This would also allow to use another PIT than the 8253 / 8254 one. Same for the PIC related code, that should be moved to a 'PIC driver', to allow to use another PIC than the 8259 one.

I agree. The question is how to keep this quite small, so as to not impact the kernel space too much. There are many other cases like this as well. In general, for portability reasons, getting all ELKS kernel hardware dependencies moved to at least a single file per hardware chip would be nice.

masking the interrupt at the PIC level does not stop the device to work in the background and to fire interrupts.

I don't understand this - shouldn't masking the interrupt at PIC level stop all interrupts from that hardware? For our particular problem, why isn't the PIC IRQ 0 mask reset until enable_irq(0) stopping EMU86 from issuing an interrupt? Isn't that an additional bug?

ghaerr commented 3 years ago

shouldn't masking the interrupt at PIC level stop all interrupts from that hardware? For our particular problem, why isn't the PIC IRQ 0 mask reset until enable_irq(0) stopping EMU86 from issuing an interrupt?

In int-elks.c:

// PIC I/O write

int int_io_write (word_t p, byte_t b)
    {
    if (p == 0) {
        if (b == 0x20) {  // EOI
            int_end_prio ();
            }
        }

    return 0;
    }

The mask byte being sent is not being looked at - in the case of '0xff' (all interrupts disabled), this is not being handled in int_end_prio() properly, correct?

My understanding is that the mask byte should be interpreted and only interrupts whose mask value is 0 should be allowed, in priority order.

With this fix, we would not need to emulate the PIT, since the initial mask value of 0xFF would have disabled any interrupts until the first call to enable_irq(0) in irq_init().

mfld-fr commented 3 years ago

I don't understand this - shouldn't masking the interrupt at PIC level stop all interrupts from that hardware?

Unfortunately, no. The PIC only 'masks' the input IRQ lines, but has no control over the devices that drive these IRQ lines. To stop all HW interrupt activity, one should address each device and disable the IRQ line driving in all that devices (i.e. 'device interrupt masking').

ghaerr commented 3 years ago

The PIC only 'masks' the input IRQ lines, but has no control over the devices that drive these IRQ lines.

Agreed - but the function of the PIC is to inhibit interrupt requests to the CPU, even if the device interrupt line is high, when the mask is set to 1, correct?

If not, what is the function of the PIC mask?

mfld-fr commented 3 years ago

The mask byte being sent is not being looked at - in the case of '0xff' (all interrupts disabled), this is not being handled in int_end_prio() properly, correct?

No need to look at the line mask in end_prio, because the masking is applied earlier:

// Ignore the signal if masked

        if (!_int_mask [line]) _int_req [line] = 1;
ghaerr commented 3 years ago

No need to look at the line mask in end_prio, because the masking is applied earlier:

int _int_mask [INT_LINE_MAX] =
    { 0, 1, 1, 1, 1, 1, 1, 1};  // timer unmasked by default

I see. ELKS uses the mask to turn off various interrupts, and _int_mask is never changed by the EOI code. I think it should be. Especially since ELKS sets the mask in enable/disable_irq:

unsigned char cache_21 = 0xff, cache_A1 = 0xff;

/*
 *      Low level interrupt handling for the X86 PC/XT and PC/AT
 *      platform
 */

void enable_irq(unsigned int irq)
{  
    unsigned char mask;

    mask = ~(1 << (irq & 7));
    if (irq < 8) {
        cache_21 &= mask;
        outb(cache_21,((void *) 0x21));
    } else { 
        cache_A1 &= mask;
        outb(cache_A1,((void *) 0xA1));
    }
}

In addition, it appears that ELKS expects the cache_21 value to be 0xFF, when EMU86 is initialized to 0x7F. But that isn't the reason for this bug, since any BIOS may have previously enabled IRQ 0 and ELKS boots.

In addition, there is an (unused) config option:

void INITPROC irq_init(void)
{
    flag_t flags;

#ifdef CONFIG_HW_259_USE_ORIGINAL_MASK  /* for example Debugger :-) */
    cache_21 = inb_p(0x21);
#endif
mfld-fr commented 3 years ago

Yes, 8259 PIC emulation in EMU86 does not handle yet the I/O access to port 0x21 to mask / unmask some specific IRQ. This should be implemented before emulating another device, otherwise the IRQ from that device would be ignored (= masked) even if ELKS unmasks it.

ghaerr commented 3 years ago

Ok, I didn't realize EMU86 doesn't emulate any other devices (yet) so it doesn't matter.

This issue can be closed, thank for your help.

I think I would like to try to use EMU86 to determine where/why exactly the ELKS kernel is crashing. I should be able to use "set breakpoint" in order to easily catch and/or trace the IRQ 0 interrupt, I will try that. EMU86 should be made useful to debug crashes like this, I'll report on that.

Other possibly interesting enhancements

Thank you!

mfld-fr commented 3 years ago

Yep, interesting enhancements & improvements ideas for EMU86, except the stack trace that is already implemented (s debugger command). For example, you can set the initial code breakpoint in the halt() function, then dump the stack.

mfld-fr commented 3 years ago

Ok, I didn't realize EMU86 doesn't emulate any other devices (yet) so it doesn't matter.

Only the PIC and the PIT for ELKS target for now, in addition to the console. For Advantech, it emulates the PIC, the PIT and the serial port for SYS86 experiment.