sorear commented 6 years ago

explanation in comments. it doesn't seem possible to test this with linux (linux does not create clean-but-writeable or old-but-valid PTEs), so we'll need to find another way to test it before merging

michaeljclark commented 6 years ago

While there is a race, it is a theoretical problem from the perspective of riscv-linux, as it needs to do a SFENCE.VMA on other threads after updating the page tables. Unless we have a reproducible failure, i'd rather we just keep this code as it is.

The second, unrelated change, will prevent TLB misses from happening for subsequent stores to a page that are initially accessed with a load. We can only set W permission if the dirty bit is already set or if it is a store. I've discussed this with Richard Henderson.

Curious whether qemu x86 PTE accessed and dirty bit updates are atomic. No, they are not. Linux would have the same issue here on x86, however it is theoretical, as one would need to craft code to exercise the race without an SFENCE.VMA (page table updates are done on one thread under the mm lock in linux).

https://github.com/riscv/riscv-qemu/blob/713f2c116481d568702759bcb1b7fed835a2d575/target/i386/excp_helper.c#L423-L431

The second change is not theoretical. It is likely frequent that we get a load to a page marked RW, and if we don't set the TLB to miss on subsequent write, we'll miss updating the dirty bit.

I'm not going to merge this PR in its current form.

I'd accept an expansion on the current comment...

michaeljclark commented 6 years ago

or maybe x86_stl_phys_notdirty has some atomicity magic. I will look...

michaeljclark commented 6 years ago

Nope. It appears target/i386 PTE updates in qemu are also racy.

sorear commented 6 years ago

@michaeljclark I asked in #qemu a week ago and the reason x86 works is because x86 does not enable mttcg. Also I don't see how SFENCE.VMA protects you here. The race is:

thread 1, in userspace, attempts to access a page, loads PTE
thread 2 in the kernel unmaps the page, sets PTE.V = 0
thread 1 rewrites the PTE with V=A=1
thread 2 issues a SFENCE.VMA IPI
thread 1 receives the IPI, but the PTE has the wrong value due to (3)
thread 1 accesses the page after honoring the IPI because it refetched the wrong PTE

Regarding the final change, I think you have this backward? I removed a case where PAGE_WRITE is set, so it causes more rewalkings, not less.

michaeljclark commented 6 years ago

This is what we have presently:

            /* only add write permission on stores or if the page
              is already dirty, so that we don't miss further
              page table walks to update the dirty bit */
            if ((pte & PTE_W) &&
                    (access_type == MMU_DATA_STORE || (pte & PTE_D))) {
                *prot |= PAGE_WRITE;
            }
            return TRANSLATE_SUCCESS;

It seems quite clear to me. What am I missing?

The old code worked because it only set the prot bits for the current access type, i.e. even more conservative.

michaeljclark commented 6 years ago

I think the correct fix for the atomicity issue is to use a compare and swap. If the PTE has changed we can't update accessed or dirty, and probably need to return TRANSLATE_FAIL, as the PTE has changed undeaneath us while we were doing the walk.

sorear commented 6 years ago

That's fine in the context of the current code but if you apply the first two hunks from my patch without the third it will result in lost dirty-bit updates, since MMU_DATA_STORE no longer implies that the page is now dirty.

sorear commented 6 years ago

I think the correct fix for the atomicity issue is to use a compare and swap. If the PTE has changed we can't update accessed or dirty, and probably need to return TRANSLATE_FAIL, as the PTE has changed undeaneath us while we were doing the walk.

I agree completely, but we don't currently have a casq_phys. If you'd prefer I can try to get a casq_phys added to upstream qemu and then we can do this cleaner.

michaeljclark commented 6 years ago

I'd rather do a compare and swap, and if it fails, return TRANSLATE_FAIL, because the PTE has changed underneath us. i.e. we got new information when trying to update AD bits, which mean the translation is invalid.

sorear commented 6 years ago

https://github.com/riscv/riscv-qemu/pull/103#issuecomment-366079923 and https://github.com/riscv/riscv-qemu/pull/103#issuecomment-366081227 seem to be exactly the same.

michaeljclark commented 6 years ago

The particular logic and comment for setting PAGE_WRITE is IMHO much clearer than the way x86 does it (sets it then later removes it). When one has a single clause i.e. if PTE.W, only add PROT_WRITE if it is a store or if page is already dirty, then its much easier to understand, as one compact statement, without needing to consider context that obfuscates the reasoning for the code.

michaeljclark commented 6 years ago

I'm going to try and find a compare and swap intrinsic. We obviously have to return TRANSLATE FAIL if we found that our PTE is no longer there.

There are second order issues. What if the PTE space is self-mapped, do we update dirty bits for the mapping for the PTE page itself? (probably not). What about AD bits for non-leaf PTEs?

michaeljclark commented 6 years ago

PTE AD bit accesses are obviously physical so one wouldn't expect pages that map them to have their dirty bit set, even when the page's underlying physical memory has been changed. Detecting changes to pages holding PTEs is only really an issue for someone who is implementing shadow page tables.

However the issue of non-leaf PTE AD bits, I'll have to read the spec... I wouldn't expect them to need to be set. The leaf PTEs, of whatever size, are the level that the OS monitors pages.

michaeljclark commented 6 years ago

It seems we can use helper_atomic_cmpxchgl_le_mmu and helper_atomic_cmpxchgq_le_mmu to perform atomics on the guest address space.

michaeljclark commented 6 years ago

actually, it would be best if we just restart the walk if the AD bits compare exchange fails.

sorear commented 6 years ago

It seems we can use helper_atomic_cmpxchgl_le_mmu and helper_atomic_cmpxchgq_le_mmu to perform atomics on the guest address space.

those (A) take virtual addresses, we need something that takes physical addresses (B) longjmp when an I/O region is encountered and are therefore unsafe to call outside TCG-generated code.

your requirement to use CAS is making this very complicated.

michaeljclark commented 6 years ago

In any case, it's a theoretical issue until we get a reproducer.

The prot flags for the dirty bit handling in the current form are correct. I can make a test case for dirty bit handling pretty easily as i've written one that does most of this testing already.

Easiest approach is to leave the code as it is.

Correct approach is to make the PTE update atomic, and re-translate if the PTE is no longer valid.

Sure it won't work on IO space, but that is an unusual corner case.

michaeljclark commented 6 years ago

If you can provide a reproducer for the issue in riscv-linux, and we see the change fixes the issue, then fine... until then its theoretical; a simulation accuracy issue that would require specially crafted code to expose. If there is a genuine bug showing up in riscv-linux, i'd like to be able to reproduce it on my side. i.e. email me a test image.

sorear commented 6 years ago

I'll see about adding a reproducer to riscv-tests. Will come back when I have one; may be a few weeks.

michaeljclark commented 6 years ago

Good. It might very well expose hardware bugs.

It reminds me of this race (although it's not a hardware bug, it documented behaviour that misaligned atomics only work within a cache line on x86):

https://gist.github.com/michaeljclark/31fc67fe41d233a83e9ec8e3702398e8

michaeljclark commented 6 years ago

Sagar, redid the load of the PTE before the update to the dirty bits, but that is kind of futile, and the previous code didn't check to see whether it got a different result on the second load. My rationale is if there is a race, adding serveral hundred pico-seconds to it doesn't make much difference. It needs to be atomic.

michaeljclark commented 6 years ago

If i was to do it in hardware I'd restart the walk, because one needs to use an AMO to update the PTE, but while one is walking, one is doing loads, not AMOs, until one finds a leaf, so there is a race in hardware too. In fact you need to do LR/SC to compare and swap. I guess it needs to take a reservation on the load to the leaf PTE, but given it doesn't know it has a leaf in advance, the page walker would have to take reservations on all page walk loads.

Interesting. Usually uncovered hardware bugs become the documented behaviour.

Also interesting to try this on x86.

michaeljclark commented 6 years ago

Here's another test. What if you have a PTE in ROM that doesn't have the AD bits set? Does the AD bit setting bus error cause the translation to fail?

Apple puts PTEs in ROM on Arm in iBoot for their secure enclave processor early boot.

michaeljclark commented 6 years ago

i.e. PTEs in ROM in not theoretical on Arm at least. The spec says to set the AD bits to supress updates. I really don't know how they make it atomic in hardware ? i.e. a potential spec bug (over constraint). Races between loads and stores in the page walker would likely be visible by other threads on x86, if one tried to invalidate the PTE while it was being walked. OS developers tend to use locks around those bits of code.

michaeljclark commented 6 years ago

That's another bug, that code doesn't handle the case of misconfigured PTEs in ROM. What happens to stores to PTE.AD bits in memory marked RO?

That actually hit me in the rv8 page walker as I had PTEs in ROM without the AD bits set. I don't know what the documented behaviour is?

michaeljclark commented 6 years ago

I think we are going to have racy updates on i386. Maybe not (with some caveats). See https://github.com/riscv/riscv-qemu/pull/106

That said, The AD bits are in the low 32-bits of the PTE, so we can figure out the address and just update the low 32-bits of the PTE on 32-bit host. It does mean we can only check 22-bits of the PPN (as there are 10 bits of flags).

In any case, that would be a corner case of a corner case. i.e. we have to have the combination of a relatively rare race, a 32-bit host, and a physical address space bigger than 34-bits (22+12 or 16GiB) and a PTE update that differs > 16GIB. I can add an #ifdef so that it also works on i386 with the caveat that we're only atomic for 16GiB address space on 32-bit hosts.

michaeljclark commented 6 years ago

It's a 3rd order corner case. In any case we will need to alter the code to work on the low word of the PTE for RV64 guests on 32-bit hosts, as it doesn't compile on i386.

sorear commented 6 years ago

We are not going to have racy updates for riscv64-on-i386 because mttcg is disabled then. Search the code for TCG_OVERSIZED_GUEST.

michaeljclark commented 6 years ago

Fir RV64 on 32-bit, in addition to losing a relatively rare race, the ppn must differ by exactly 16GiB i.e. the low 22-bits of the PPN and flags need to match and the high order bits need to differ. I think its relatively safe to say probabilistically that this is very unlikely to occur. If we restrict physical address space to 16GiB on 32-bit hosts, then it is impossible.

sorear commented 6 years ago

105 is a better patch than I expected and I'd rather not discuss it in two places.

michaeljclark commented 6 years ago

Okay, that sounds better, I'll #ifdef the old non-atomic code path inside TCG_OVERSIZED_GUEST and try compiling it on i386...

sorear commented 6 years ago

@michaeljclark suggested https://github.com/riscv/riscv-qemu/pull/105/files#r168971263

riscvarchive / riscv-qemu

RISC-V: mttcg is incompatible with atomic PTE updates #103

105 is a better patch than I expected and I'd rather not discuss it in two places.