tb_flags and mmu_index - Githubissues

sorear commented 6 years ago

Two commits from my tree have gone missing during the change of administration:

sorear/riscv-qemu-system@a038a287 sorear/riscv-qemu-system@30c9f4fc

One of these is a correctness fix; basically, gen_intermediate_code needs to not use any information from the CPUState that isn't included in tb_flags. The current code is mixing up translations generated in different privilege and ISA states, which caused all Go executables to crash, and will cause more problems with mixed-ISA cores.

How do we proceed from here? @michaeljclark do you understand the issues involved in these patches well enough to proceed, or shall I try to redo them on the upstreaming branch? I don't think this needs to be an upstreaming blocker, although it was noticed.

michaeljclark commented 6 years ago

@sorear Thanks!

Yes I understand the issue. I'll review your commits and incorporate them very soon i.e. before the v4 spin of the upstream patch series.

That's the reason why I kept your pull request open. Is it safe to assume we can close your PR once the tb_flag changes are incorporated?

sorear commented 6 years ago

When we have:

PLIC and virtio support
Documented build instructions somewhere conspicious for producing a priv-1.10 vmlinux with virtio drivers and running it in qemu
WFI support
Privilege changes without tlb_flush
tb_flags usage working

then I'd be comfortable closing that PR. It sounds like we have 1 and 3, and 2 via sifive/freedom-u-sdk#21 ?

sorear commented 6 years ago

Fixed typo in link to the (4) patch.

michaeljclark commented 6 years ago

I think you might need to rebase your patches.

Regarding privilege mode. Generated code should be the same regardless of the privilege mode it was generated under. We shouldn't be imposing any architectural constraint that does not exist in the architecture itself. The processor allows the same piece of code to be run at multiple privilege levels. While this may very well be an attack, there is nothing the processor can do about it. Adding mode to tb_flags will just have the effect of causing the code to be re-translated (duplicated).

I can't see any invariant conditional code generation in translate.c besides misa which we can solve by disabling misa. Support for misa writes are optional and what I would consider "nice-to-have", but at the moment there is no hardware I know of that supports on-the-fly misa changes so it is a bit of a corner case. It is also not exercised by any present-day code e.g. riscv-linux does not write to misa. U-mode code does not have access to misa so this pariticlar issue should not affect riscv-go.

Privilege checks are performed by the CSR helpers which means the codegen for the different privilege modes (calls to helpers) should be correct as the generated code is exactly the same regardless of the privilege level when the code was generated. We can safely refer to any cpu state that is static i.e. does not change after initialization or is invarient to the generated code. The codegen for helpers is the same regardless of privilege level as the privilege level is checked in the helper routines.

It is correct that there is a misa write invariant bug with respect to cached translations. I've disabled misa writes in my branch as an expedient fix. A brute force approach would be to call tb_flush().

Regarding mmu_index, i'm indeed curious about how we can eliminate unnecessary soft tlb flushes when enabling and disabling SUM which is done frequently in copy_to_user and copy_from_user when entering and leaving the kernel. However, I tried a simple test by adding SUM to mmu_idx and masking it off to split privilege mode and the SUM flag but this simple change resulted in hangs.

Below is a simple change to pass SUM via mmu_index and it causes linux-kernel to hang (I don't know if the linux-kernel you tested was using SUM as support was not in earlier kernel e.g. if I pass MXR, linux-kernel is fine, because MXR is not actually used and is constant):

diff --git a/target/riscv/helper.c b/target/riscv/helper.c
index 9b029a0..7dc6ea1 100644
--- a/target/riscv/helper.c
+++ b/target/riscv/helper.c
@@ -30,22 +30,24 @@ int riscv_cpu_mmu_index(CPURISCVState *env, bool ifetch)
 #ifdef CONFIG_USER_ONLY
     return 0;
 #else
-    target_ulong mode = env->priv;
+    int mode = env->priv, sum;
     if (!ifetch) {
         if (get_field(env->mstatus, MSTATUS_MPRV)) {
             mode = get_field(env->mstatus, MSTATUS_MPP);
         }
     }
     if (env->priv_ver >= PRIV_VERSION_1_10_0) {
+        sum = get_field(env->mstatus, MSTATUS_SUM);
         if (get_field(env->satp, SATP_MODE) == VM_1_10_MBARE) {
             mode = PRV_M;
         }
     } else {
+        sum = !get_field(env->mstatus, MSTATUS_PUM);
         if (get_field(env->mstatus, MSTATUS_VM) == VM_1_09_MBARE) {
             mode = PRV_M;
         }
     }
-    return mode;
+    return (mode & 3) | (sum & 1) << 2;
 #endif
 }

@@ -123,7 +125,8 @@ static int get_physical_address(CPURISCVState *env, hwaddr *physical,
      * correct, but the value visible to the exception handler
      * (riscv_cpu_do_interrupt) is correct */

-    const int mode = mmu_idx;
+    const int mode = mmu_idx & 3;
+    const int sum = (mmu_idx >> 2) & 1;

     *prot = 0;
     CPUState *cs = CPU(riscv_env_get_cpu(env));
@@ -136,12 +139,11 @@ static int get_physical_address(CPURISCVState *env, hwaddr *physical,

     target_ulong base;

-    int levels, ptidxbits, ptesize, vm, sum;
+    int levels, ptidxbits, ptesize, vm;
     int mxr = get_field(env->mstatus, MSTATUS_MXR);

     if (env->priv_ver >= PRIV_VERSION_1_10_0) {
         base = get_field(env->satp, SATP_PPN) << PGSHIFT;
-        sum = get_field(env->mstatus, MSTATUS_SUM);
         vm = get_field(env->satp, SATP_MODE);
         switch (vm) {
         case VM_1_10_SV32:
@@ -159,7 +161,6 @@ static int get_physical_address(CPURISCVState *env, hwaddr *physical,
         }
     } else {
         base = env->sptbr << PGSHIFT;
-        sum = !get_field(env->mstatus, MSTATUS_PUM);
         vm = get_field(env->mstatus, MSTATUS_VM);
         switch (vm) {
         case VM_1_09_SV32:

If you rebase your patches against the current tree I can consider incorporating them (after testing). I have a large suite of test images... and folk at SiFive have been running the glibc test suite. It would also be nice to have a reproducer for the riscv-go issues. Disabling misa writes solves the first issue. The mmu_index issue is a performance issue and I'd be happy to merge code that works. If we can figure out why simply adding the SUM flag to mmu_idx causes hangs, then we can likely address the mmu_index issue. Adding MXR, of course, doesn't cause any issues because the flag is not actually used. I need to do some further digging to figure out the scope of mmu_index. I notice that SMAP on i386 appears to modify mmu_index however i'm not sure why it works there.

Here's my development branch which solves the misa issue in a different way (by disabling writes):

https://github.com/michaeljclark/riscv-qemu/tree/qemu-devel

I think i need help with the mmu_index performance improvement (to eliminate tlb flushes to mstatus flags such as SUM).

michaeljclark commented 6 years ago

I have to say I find it rather unusual that ctx->mem_idx (cpu_mmu_idx) is used in the codegen interface for loads and stores essentially generating different code for loads and stores depending on the privilege mode.

I note that the CSR* functions exit the translation block so that mstatus changes that result in changes to mmu_idx (namely riscv_set_mode) cause an exit from the translation block, such that the next translation block has the new value for mmu_idx.

In any case, the small patch in the comment above demonstrates that there is consistency issue withmmu_idx and the SUM flag. I don't know if you had tested your patch with a PUM/SUM enabled copy_to/from_user

I also don't know whether you have tested whether the dirty bit is being set after your changes. I've followed the approach of spike and rv8, only caching r or x permissions on loads so that successive stores will get a TLB miss, and re-walk the page table, as the dirty bit is only set on page table walks. See here:

https://github.com/michaeljclark/riscv-qemu/blob/1f3482605600f88592e018393f952987d722813b/target/riscv/helper.c#L217-L227

sorear commented 6 years ago

mem_idx in translate.c is generated as a function of the current privilege mode and every location where mem_idx is used is a location where generated code depends on the privilege mode at the time of translation. It would be possible to avoid this, but at the cost of requiring a tlb_flush on all privilege changes.

Regarding Go, at the time this was a problem an asm("rdtime a0") executed in user mode was sufficient to cause a crash; this depends on details of Linux and BBL and if it fails to repro it does not prove that qemu is clean.

Regarding disabling misa writes, flushing on misa write is insufficient, because you can have two cores with different misa values, and tb_flush is global.

I'll start working on the rebase. There were several bugs in that code that partially cancelled out; it's necessary to fix all of them.

sorear commented 6 years ago

I'm less interested in matching spike and rv8 than I am in matching qemu on other architectures (reduce maintenance and review burden).

Regarding the codegen interface for loads and stores, each mmu_idx selects a cache for address translations; supervisor code and user code cannot share a translation cache unless we flush it at privilege changes, which is expensive.

You are correct about the rationale for ending the translation buffer after CSR instructions.

michaeljclark commented 6 years ago

i386 does the same as spike and rv8. It removes PAGE_WRITE unless it is a write. It would be a bug to set all protection flags as we would break dirty bit handling, which relies on another page walk being triggered.

    if (!(pte & PG_DIRTY_MASK)) {
        /* only set write access if already dirty... otherwise wait
           for dirty access */
        assert(!is_write);
        prot &= ~PAGE_WRITE;
    }

michaeljclark commented 6 years ago

I believe it is what rocket does too. Andrew mentioned this to me. I was originally stashing a pointer to the PTE in the TLB so I didn't need to rewalk the page table to update the dirty bit, but it makes the TLB entry larger, and the PTE might point somewhere else next time it is walked. I changed rv8 to be consistent with how it tends to be done in hardware.

Although your code misses updating dirty bits altogether if the first access is a load or a fetch as the TLB now has write in its permission bits, so won't miss on subsequent stores.

michaeljclark commented 6 years ago

I understand how mem_idx works in principle. We currently have mode in mmu_idx (taking into account MPRV). I just don't understand why adding SUM breaks things. I was able to safely remove one tlb_flush on mode changes given mode is in mmu_idx and any instruction that changes mmu_idx (CSRW) causes a translation exit, making the next section take into account the new mmu_idx. There are comments to that effect.

It's just the TLB flushes on other mstatus bits which we are missing. SUM is the most important one for performance because of copy_to/from_user however MXR could also be done using the same mechanism assuming we can get SUM to work.

It would be nice if we can actually identify the bugs in this code, if any. There should be reduced page walks for loads or fetches on RX sections given I took a cue from your patch and set both R+X if they are both present in the PTE. I just didn't want to break dirty bit handling.

BTW - Linux kernel and bbl are currently mapped RWX as 2MiB pages hence the kernel memory protection warning message on boot. We need to adjust the bbl payload and linux sections to 4KiB boundaries so we can free initdata, mark rodata RO and text RX. That is probably higher priority than saving avoiding one TLB flush, given it is a relatively major security issue...

sorear commented 6 years ago

I do not believe that Linux's choice of how to construct its page tables has anything whatsoever to do with this thread, which is about QEMU's handling of CSR writes and translating code under different MPRV values.

michaeljclark commented 6 years ago

Regarding misa. I'm trying to avoid risky or large changes until we get upstream. Forbidding misa writes is valid given no known hardware actually supports it and it is specifically mentioned as optional in the spec. As far as I know, SiFive are deferring supporting it in hardware. It seems expedient to just disable misa writes for the time being. Someone can contribute the feature when we are upstream. The bug is solved until then. i.e. we no longer have variable state that affects codegen in translate.c because misa is now read-only. Simplicity and correctness until then...

To support MXL, SXL and UXL runtime changes (which are conceptually similar to misa changes and are supported by RISCVEMU) we actually need to remove all of the compile time #if defined (TARGET_RISCV32) #elif defined (TARGET_RISCV64) and replace with runtime checks, possibly folding qemu-system-riscv32 and qemu-system-riscv64 into qemu-system-riscv. Likewise this may also mean folding qemu-riscv32 and qemu-riscv64 into qemu-riscv. Or perhaps we could just support mode switches between RV32/RV64 in qemu-system-riscv64 and keep the same existing with multiple binaries.

We can log an issue "enable misa writes to misa" and we can keep your pull open...

sorear commented 6 years ago

There are correctness issues that aren't addressed by disabling writes to misa. Please wait for me to finish the rebase.

sorear commented 6 years ago

You have not solved the bug, because you have variable state (PRIV, MPP, MPRV, etc) which affects codegen. Please leave MXL out of this, it's a different issue and no known core supports writable MXL. Writable M/A/F/D/C is required to support some configurations of rocket-chip, it could be excluded for initial upstreaming but we absolutely need MPRV working.

michaeljclark commented 6 years ago

BTW - I spotted another bug in the existing get_physical_address code related to the accessed and dirty bits.

One can set the access and dirty bits to 1 to increase performance, i.e. have the page table walker elide the stores to update accessed and dirty bits. Setting the access and dirtied bits to 1 is also a technique that allows reading page table entries from ROM.

This is mentioned in the specification.

The A and D bits are never cleared by the implementation. If the supervisor software does not rely on accessed and/or dirty bits, e.g. if it does not swap memory pages to secondary storage or if the pages are being used to map I/O space, it should always set them to 1 in the PTE to improve performance.

The current code does the store unconditionally. I will fix this...

michaeljclark commented 6 years ago

Fixed MPRV by making a minimal change and added cpu_mmu_index to tb_flags:

This is performing about 2.5X faster than the earlier code on the dd test.

I've also fixed accessed and dirty updates (although they still need to be made atomic, however the current SMP support is multiplexed so this is not an issue): https://github.com/michaeljclark/riscv-qemu/blob/82012aef90e5c4500f926523b3b2ff621b0cd512/target/riscv/helper.c#L201-L207

It seems to be a fundamental design flaw that generated code is processor state dependent.

sorear commented 6 years ago

Please hold off on moving the baseline so that I can do the rebase like you asked me to.

michaeljclark commented 6 years ago

At this point, there is only the possibility to elide TLB flush on MPRV/SUM. i.e. performance improvement vs bug fix. Unless you can point out a bug in the existing logic. If you rebase, please don't break dirty bit handling. There were other bugs I have fixed in that code.

michaeljclark commented 6 years ago

@palmer-dabbelt @sorear if there is still a bug remaining in this code, then can we please focus on a minimal fix. I'm wary of making large changes to this code before upstreaming as it has had extensive testing. The patch should identify the faulty logic it is correcting.

e.g. https://github.com/michaeljclark/riscv-qemu/commit/82012aef90e5c4500f926523b3b2ff621b0cd512

RISC-V - Eliminate MPRV influence on mmu_index

There is a chance that M mode traces may be run with different
mstatus.MPRV setting meaning memory access would have an invalid
mmu_index, so we defer reading MPRV until get_physical_address
is called. This has a negliible impact on performance.

One alternative is to add cpu_mmu_index in tb->flags however
this seems like it should be fixed in generic code. i.e. traces
should implicitly include cpu_mmu_index. Note: helpers that can
modify MPRV terminate the current trace block so that cpu_mmu_index
is revalidated but this does not solve the problem where a trace
could be re-entered with a different MPRV setting.

To avoid TLB flushes from flags that affect virtual memory
resolution, the traces would be indexed by cpu_mmu_index.

We also need to get any proposed fix tested by @palmer-dabbelt

sorear commented 6 years ago

After reviewing the current tb_flags and mmu_index code I think we're good.

I'd like to see better support for heterogenous SoCs, which will require encoding some amount of ISA information in the tb_flags, but for the features that we have now the flags handling is fine.

riscvarchive / riscv-qemu

tb_flags and mmu_index #92