mikaku / Fiwix

A UNIX-like kernel for the i386 architecture
https://www.fiwix.org
Other
401 stars 32 forks source link

Remove unneeded low memory kernel virtual address mappings #88

Closed ghaerr closed 5 months ago

ghaerr commented 5 months ago

Previously, Fiwix mapped the memory used for the master page directory (kpage_dir) and associated page tables as an identity mapping because both kpage_dir was kept as a physical address and map_kaddr was written to require it.

In addition, the VGA video memory was identity mapped into kernel virtual address space.

This approach ended up creating entries in the lowest (low 4MB) entry in the master Page Directory, below PAGE_OFFSET, as the entire page directory is copied for each new process created, thus adding an undesirable memory mapping into every process. Things worked only because the ELF executable start address is usually at ~128M (0x08004800) which is above the 1M -> _last_data_addr (0x00100000+) KVA mapping used.

By only using map_kaddr after the final kernel Page Directory and page tables have been activated, it was rewritten to use the already-mapped PAGE_OFFSET virtual addresses instead. This saves a lot of needless extra mapping tables and also removes any entries created below PAGE_OFFSET in the master page directory, used by fork, etc. More physical memory is now available as free page frames as a result.

The VGA and FBCON drivers were also updated to use video.address + PAGE_OFFSET to eliminate the need for an identity mapping below 1M. The identity mapping for the kernel itself to 1M was also removed, as that was unneeded after setup_tmp_pgdir was called.

Tested using VGA and FB console on QEMU with and without initrd.

ghaerr commented 5 months ago

BTW, in order to see the kernel low memory map allocated in user space processes before this PR change, apply the following:

diff --git a/kernel/syscalls/fork.c b/kernel/syscalls/fork.c
index 5d99002..068ae24 100644
--- a/kernel/syscalls/fork.c
+++ b/kernel/syscalls/fork.c
@@ -84,6 +84,10 @@ int sys_fork(int arg1, int arg2, int arg3, int arg4, int arg5, struct sigcontext
    }
    child->rss++;
    memcpy_b(child_pgdir, kpage_dir, PAGE_SIZE);
+    int i;
+    for (i=0; i<512; i++) {
+        if (child_pgdir[i]) printk("[%d:%08x]", i, child_pgdir[i]);
+    }
    child->tss.cr3 = V2P((unsigned int)child_pgdir);

    child->ppid = current->pid;

Each forked process will display a single [0:<mapped kaddr>] value which was the PDE mapping to low memory in the virtual address space (for kernel use only).

ghaerr commented 5 months ago

BTW, the next step in these memory management changes will be the elimination of having to call map_kaddr in mem_init(). This is because with the next small enhancement, map_kaddr will be able to call kmalloc to allocate page tables, rather than having to reserve space using the current _last_data_addr = map_kaddr(..., _last_data_addr, ...) being used now to reserve page tables.

That in turn will allow various drivers (including BGA and FBCON initially) to include all their kernel address mapping code within their own source, rather than having to include portions in mm/memory.c.

That change would then eliminate the following code in mem_init():

    /*
     * FIXME:
     * Why do I need to reserve the page tables for video framebuffer
     * here, instead of using kmalloc() in fbcon_init() and bga_init()?
     */
    video.pgtbl_addr = _last_data_addr;
    if(video.flags & VPF_VESAFB) {
        /* reserve 4 page tables (16MB) */
        _last_data_addr += (PAGE_SIZE * 4);
    }
mikaku commented 5 months ago

Yes, before #88 QEMU showed this memory mappings:

(qemu) info mem
00000000000a0000-00000000010a0000 0000000001000000 -rw
00000000c0000000-00000000cffdf000 000000000ffdf000 -rw
00000000fd000000-00000000fe000000 0000000001000000 -rw

After #88 it shows:

(qemu) info mem
00000000c0000000-00000000cffdf000 000000000ffdf000 -rw
00000000fd000000-00000000fe000000 0000000001000000 -rw

The problem is that the Kexec feature no longer works. You can test it by enabling the KEXEC_OPTION, rebuilding the kernel, and finally including the following arguments in the kernel command line (in GRUB):

kexec_proto=multiboot1 kexec_size=500 kexec_cmdline="ro root=/dev/hda2"

As soon as you press the key b to boot, it just reboots.

I've not checked other features. That was the first one that I wanted to test.

ghaerr commented 5 months ago

The problem is that the Kexec feature no longer works.

Thank you for your testing. I thought it would work since when Kexec is enabled, the following line should actually map a low-memory kernel address (which will be copied into user space, etc):

last_data_addr = map_kaddr(KEXEC_BOOT_ADDR, KEXEC_BOOT_ADDR + (PAGE_SIZE * 2), _last_data_addr, PAGE_PRESENT | PAGE_RW);

My plan was to move that mapping into the kernel/kexec.c code, not to executed until actually needed. That, in turn, would require that the previous line:

bios_map_reserve(KEXEC_BOOT_ADDR, KEXEC_BOOT_ADDR + (PAGE_SIZE * 2));

reserve memory such that the KEXEC_BOOT_ADDR pages remain free. I hadn't tested that yet. I've been reading up on all the PRs, documentation and code in Kexec and will look into what is happening.

Thank you!

ghaerr commented 5 months ago

(qemu) info mem

How is it that you get into the QEMU monitor?

mikaku commented 5 months ago

How is it that you get into the QEMU monitor?

In the top menu of QEMU, go to View->compatmonitor0 or just press Ctrl+Alt+2.

ghaerr commented 5 months ago

including the following arguments in the kernel command line (in GRUB): kexec_proto=multiboot1 kexec_size=500 kexec_cmdline="ro root=/dev/hda2"

Can the GRUB command line be provided to QEMU -kernel using the -append option, so Kexec can be tested using QEMU? I am worried about the double quote requirement around kexec_cmdline=.

mikaku commented 5 months ago

I am worried about the double quote requirement around kexec_cmdline=.

Escaping the double quote should do the job, although I have never tested it before.

ghaerr commented 5 months ago

You can test it by enabling the KEXEC_OPTION, rebuilding the kernel, and finally including the following arguments in the kernel command line (in GRUB): kexec_proto=multiboot1 kexec_size=500 kexec_cmdline="ro root=/dev/hda2"

Sorry for the dumb questions here... Where is the kexec kernel specified in the above, or is it specified by running the kexec command from a shell prompt? I'm a bit confused as to exactly when the kexec kernel is being booted for your test code.

Also, a quick look at kernel/kexec.c shows that for the Linux kernel, it depends on a low-memory identity mapping when it clears a bunch of memory:

    /*
     * Clear memory. This is intended to avoid unexpected results if the
     * new kernel guesses its uninitialized variables are zeroed.
     */
    for (dst = (char *)0, len = KEXEC_BOOT_ADDR; len; len--) {
        *dst++ = 0;
    }

    for (dst = (char *)0x100000, len = (unsigned int)kernel_src_addr - 0x100000; len; len--) {
        *dst++ = 0;
    }

    /* install the kernel previously stored in RAMdisk by the user */
    for (dst = (char *)0x100000, src = kernel_src_addr, len = kernel_size; len; len--) {
        *dst++ = *src++;
    }

This will definitely fail, as the low memory mapping has been removed!

Looking at the kexec_multiboot1() code shows that it uses the memory below KEXEC_BOOT_ADDR, which will also cause a page fault because no mapping:

    idle->tss.eip = (unsigned int)KEXEC_BOOT_ADDR;

    /* stack starts at the end of the page */
    esp = (unsigned int *)(KEXEC_BOOT_ADDR + PAGE_SIZE - 4);

    /* space reserved for the cmdline string (256 bytes) */
    esp -= 256 / sizeof(unsigned int);

Thus, I am pretty sure this is the problem.

IMO, the special low memory identity mapping for kexec should be moved out of mem_init() since it is only executed once, and only during a kexec. This will also help cleanup the problem of having device driver and other "special" stuff being handled in mem_init, rather than their own respective init routines.

It appears, at least for the Linux kexec, that the kexec_linux() routine stomps all over memory has it loads the Linux kernel. Thus, a failed kexec_linux is irrecoverable. For kexec_multiboot(), it appears to only require KEXEC_BOOT_ADDR for two pages, and (now), some pages below that to write the multiboot header, etc.

Thus, perhaps when CONFIG_KEXEC is enabled, a few pages around KEXEC_BOOTADDR could be reserved (for multiboot), and then when kexec is executed, an identity mapping put in place at the start of the kexec routines. This appears to be able to work well, since it seems there is never a return from the kexec_xxx routines.

This approach would leave the low level identify mapping out of the page directory until kexec is actually called/used, with the exception that a small amount of pages would be reserved around KEXEC_BOOT_ADDR the CONFIG_EXEC is compiled in.

What are your thoughts?

ghaerr commented 5 months ago

This will definitely fail, as the low memory mapping has been removed!

Sorry - it didn't notice that for kexec_multiboot, paging has been disabled. Thus I'm totally incorrect, you can ignore most of my previous comment. In addition, the zeroing of memory occurs mostly for multiboot, not linux kexec.

I am unsure as to exactly why the kexec is failing. I will investigate further.

mikaku commented 5 months ago

You need some information key about the Kexec feature on Fiwix. The docs/kexec.txt document will help you to understand it better.

Keep in mind that by adding the parameters kexec_proto= ... kexec_size=... etc. nothing in the file kernel/kexec.c is being executed. The functions there are only executed when the system shuts down, which is when the Kexec feature executes.

What the Fiwix kernel does when you put these parameters in the command line is basically to prepare the /dev/ram device with the size specified in kexec_size to hold the kernel that you plan to Kexec. Hence, if the system reboots promptly is because something wrong is in mem_init() and/or during the memory reservation for the RAMdisk drive.

I hope that helped you.

mikaku commented 5 months ago

After reading the file docs/kexec.txt, you might also want to take a look to #65, to get more information on how Kexec using Linux protocol works, which is a bit different than with Multiboot1 protocol.

ghaerr commented 5 months ago

I learned how to use the KEXEC feature, and ended up finding the problem. I can now use your above instructions to boot another Fiwix kernel using the KEXEC feature with this PR's changes.

It seems that with these changes removing the low memory identity mapping, another kernel bug was found. What was happening is that when I compiled the kernel with #define CONFIG_KEXEC, the kernel would not boot at all! Chasing this down revealed the following code path being executed:

#ifdef CONFIG_KEXEC
    if(kexec_size > 0) {
        bios_map_reserve(KEXEC_BOOT_ADDR, KEXEC_BOOT_ADDR + (PAGE_SIZE * 2));
        _last_data_addr = map_kaddr(KEXEC_BOOT_ADDR, KEXEC_BOOT_ADDR + (PAGE_SIZE * 2), _last_data_addr, PAGE_PRESENT | PAGE_RW);
    }
#endif /* CONFIG_KEXEC */

Here is the bios_map_reserve function, called above with CONFIG_KEXEC enabled:

void bios_map_reserve(unsigned int from, unsigned int to)
{
    if(is_addr_in_bios_map(from)) {
        bios_map_add(from, to, MULTIBOOT_MEMORY_AVAILABLE, MULTIBOOT_MEMORY_RESERVED);
        reserve_pages(from, to);
    }
}

What was happening is that reserve_pages was called:

void reserve_pages(unsigned int from, unsigned int to)
{
    struct page *pg;

    while(from < to) {
        pg = &page_table[from >> PAGE_SHIFT];
        pg->data = NULL;
   ...

This function reserve_pages is being called early in mem_init() BEFORE the page_table variable has been initialized by page_init()! Without the low address mappings, it apparently somehow didn't crash, but without them, the kernel crashes and reboots.

I pushed a one-line fix, and now the KEXEC feature works:

void bios_map_reserve(unsigned int from, unsigned int to)
{
    if(is_addr_in_bios_map(from)) {
        bios_map_add(from, to, MULTIBOOT_MEMORY_AVAILABLE, MULTIBOOT_MEMORY_RESERVED);
        if (page_table_size)
            reserve_pages(from, to);
    }   
}

The reserve_pages call isn't strictly necessary in this code path, since page_init also reserves pages that aren't in the bios_map using:

        /*
         * Some memory addresses are reserved, like the memory between
         * 0xA0000 and 0x100000 and other addresses, mostly used by the
         * VGA graphics adapter and BIOS.
         */
        if(!is_addr_in_bios_map(addr)) {
            pg->flags = PAGE_RESERVED;
            kstat.physical_reserved++;
            continue;
        }

After getting everything working and testing quite a bit, I wanted to move the map_kaddr out of mem_init into kexec_multiboot1 (and kexec_linux). But for some reason, moving the map_kaddr there, even after reserving a page table for it similar to the way the video page tables are reserved, I could not get it to work. I'm not sure why. That means that Fiwix still reserves the addresses 0x9D000 - 0x9EFFF in the kernel page directory when CONFIG_KEXEC is set.

Thus this PR only removes low memory kernel VA mappings completely when CONFIG_KEXEC is not defined, and when defined, only a small address set is mapped, which is still an improvement.

ghaerr commented 5 months ago

Regarding moving map_kaddr out of mem_init into kexec_multiboot1: I’ve now realized I think the problem is that map_kaddr is updating the master kpage_dir, not the copies that need it (the current and idle processes).

This can likely be enhanced in another PR if you like the idea. That would free up all low memory mappings even with KEXEC enabled, if desired.

ghaerr commented 5 months ago

With the last commit, a number of enhancements, started with this PR, are completed. These include:

mikaku commented 5 months ago

It seems that with these changes removing the low memory identity mapping, another kernel bug was found.

Oh, that bug was well hidden. I think that this is a good case of when two errors compensate themselves.

Thus this PR only removes low memory kernel VA mappings completely when CONFIG_KEXEC is not defined, and when defined, only a small address set is mapped, which is still an improvement.

Agreed. Good work and thanks for the big effort you put here.

With the last commit, a number of enhancements, started with this PR, are completed. These include:

Removing all low memory kernel virtual address mappings, including when CONFIG_KEXEC set Enhance map_kaddr to allow kmalloc allocation of page tables for kernel mappings Enhance map_kaddr to allow kernel mappings for other processes, not just master kpage_dir Remove requirements for reserving page table space in mem_init for framebuffer and kexec Move most special handling of kernel mappings out of mem_init to kexec_multiboot and kexec_linux functions

This looks like an amazing PR. I always feel scared with so many critical changes because it's difficult to make sure that all will continue working. I need to do some testings to make sure all is working fine.

ghaerr commented 5 months ago

I always feel scared with so many critical changes

Actually, none of Fiwix memory management is changed in any fundamental way. The biggest change is that map_kaddr was rewritten to use virtual addresses for accessing the PDE and PTEs, and map_kaddr is only used/called by the FBCON and BGA drivers, along with KEXEC; that's it. The move to using only high-mapped virtual addresses in map_kaddr allowed the previous map_kaddr calls to map lower kernel address space to be removed.

I was pretty careful not to change anything actually significant, so I feel pretty good about the changes, especially since badly-mapped address spaces tend to produce page faults in application programs very quickly.

I need to do some testings to make sure all is working fine.

Since the PTE mappings for FBCON and BGA work slightly differently, it might be a good idea to test different console framebuffer variations as well as VGA console. I have also done that.

Also, none of the KEXEC code has changed - except performing two map_kaddr calls - the first maps KEXEC_BOOT_ADDR into the current process space so that the trampoline can be copied to low physical memory, and the second maps the same into the idle process, which is required since the TSS switch returns using EIP set to the same trampoline boot address. (Previously, only one map_kaddr was required since it was mapped into the master kpage_dir PDE table).

mikaku commented 5 months ago

The move to using only high-mapped virtual addresses in map_kaddr allowed the previous map_kaddr calls to map lower kernel address space to be removed.

Yes, now the name map_kaddr has even more sense.

Since the PTE mappings for FBCON and BGA work slightly differently, it might be a good idea to test different console framebuffer variations as well as VGA console. I have also done that.

Yes, all testings are doing good.

Also, none of the KEXEC code has changed - except performing two map_kaddr calls - the first maps KEXEC_BOOT_ADDR into the current process space so that the trampoline can be copied to low physical memory, and the second maps the same into the idle process, which is required since the TSS switch returns using EIP set to the same trampoline boot address. (Previously, only one map_kaddr was required since it was mapped into the master kpage_dir PDE table).

I'm surprised how quickly you feel comfortable with the code. You are a very skilled person. You look like a veteran C programmer. I'm very pleased to have you here around this project. :-)

mikaku commented 5 months ago

Thank you very much.