Interaction between management instructions and dirty bit

jrtc27 commented 2 years ago

The prefetch and zeroing instructions are very clear about how they interact with the accessed and dirty bits, but there is no similar discussion for the management instructions, and the semantics of that are quite important. Am I missing something or is this totally unspecified at the moment? I can see scope for multiple options here. The fact that the instruction is allowed to invalidate even when you don't have store permission (which is rather frightening) makes it particularly unclear, as there is not an implicit write translation that means it can just unconditionally dirty the page like on other architectures. To be honest, this might be a justification for enforcing write permission on PTEs...

jrtc27 commented 2 years ago

Also, if you're trying to invalidate a read-only page whose accessed bit is clear on hardware that relies on software A+D emulation, the fact that you get a store/AMO fault for that is going to be a total PITA for software to deal with, as it can't just treat a store/AMO fault as requiring store permission on the PTE to not be a fatal fault any more.

jrtc27 commented 2 years ago

So, whilst in practice there may not be a security justification for mandating PTEs be writeable, I strongly believe that that relaxation makes it a complete nightmare to have any kind of sane behaviour when it comes to accessed and dirty bits, and that the extension should just follow other architectures like AArch64 and require write permission rather than trying to be clever about it.

dkruckemyer-ventana commented 2 years ago

A quick one to address the specification: For management CBOs, the spec does state what happens with the accessed bit and, by omission, what happens with the dirty bit (don't check and don't change). I agree that the latter should probably be made more explicit in the specification, especially since the dirty bit behaviors are explicit for the other two extensions.

The other issue of permissions is a longer topic that I have promised multiple people I would get back to (see issue #44 - we certainly owe the community a lengthy description of the rationale for posterity). Please be patient, and I promise I'll have something in the new year.

jrtc27 commented 2 years ago

Not checking and not changing sounds terrifying...

dkruckemyer-ventana commented 2 years ago

One perspective is that management CBOs are not explicit stores, and that the dirty bit is only set when an explicit store is executed. Flush and clean do not change the values visible to the harts, and although invalidate might, it's not actually "dirtying" memory. Certainly, those instructions have performed a memory access, so checking and setting the accessed bit seems appropriate.

In what way is the above "terrifying"?

jrtc27 commented 2 years ago

Because an invalidate is a programmer-initiated instruction to potentially change memory, which is a dirtying action in my book

jrtc27 commented 2 years ago

The result being that it's basically impossible to delegate CBO.INVAL permission to entities that sit above the layer doing swap, for example, even if you make sure to clean all pages before handing them out. Otherwise there's this potential sequence:

Userspace writes value 1 to page
Kernel marks page as dirty
Userspace writes 2 to page
Kernel swaps page out
Userspace reads 2 from page (kernel swaps page in but leaves page on disk)
Userspace invalidates page
Userspace reads back the original 1 because that never left the LLC (unlikely, but possible)
Kernel evicts page again, but doesn't re-write out to disk because still dirty
Userspace reads back the 2, despite having seen the 1 (kernel swaps page in)

Now, yes, that's a bit of a contrived sequence, but the point is to demonstrate the nature of the problem. I think the irony is that, by relaxing the preconditions of the instruction, you've actually constrained its potential use.

dkruckemyer-ventana commented 2 years ago

Because an invalidate is a programmer-initiated instruction to potentially change memory, which is a dirtying action in my book

Ah, but the semantic intent of invalidate is to discard stale cached data, either to expose data in memory that have been written by a "non-coherent" agent or to drop data written by a hart but are no longer useful. In the former case, the hart is a consumer (i.e. reader) of data produced by the non-coherent agent, and in the latter case, the hart is executing the op as a performance optimization to prevent consuming memory bandwidth to write back stale data. Although the latter case clearly requires write permission (and therefore it would be appropriate to require write permission for the invalidate), the system may not want to grant write permission to the consumer in the former case, so requiring write permission for the invalidate is not necessarily appropriate.

One could certainly argue that those two use-cases have different semantic intent, and therefore each requires a separate instruction, mostly to handle the permissions/security issues differently.

(What I think this highlights is that "write permission" is fundamentally different from "invalidate permission," but there's not sufficient justification to make this distinction in the page tables. Hence, we have a global switch in CSRs to map the invalidate instruction to a different, "safe" operation for the consumer case.)

Regardless, in both cases, the invalidate is not creating "new" data in memory, like an explicit store does, so the dirty bit is not checked or set on these accesses.

jrtc27 commented 2 years ago

You could equally view it as an uncached (cache-bypassing) load, cached store of the uncached data and clean. Look, I get the argument, but I do not agree with it, and I believe it's going to come back to bite. There is surely a reason that Arm switched from requiring read to requiring write permission when moving from AArch32 to AArch64.

dkruckemyer-ventana commented 2 years ago

Now, yes, that's a bit of a contrived sequence, but the point is to demonstrate the nature of the problem. I think the irony is that, by relaxing the preconditions of the instruction, you've actually constrained its potential use.

I like contrived sequences. They make architecture fun. :)

What I don't understand is how the old value of 1 remains in a cache after the second store of 2 (step 3) or the page has been swapped out (step 4) and then back in (step 5). Swapping out and in pages would require some sort of hardware coherent DMA or cache management plus non-coherent DMA, but in either case, the value of 2 should be stored to disk and all copies with a value of 1 should be invalidated, updated, or inaccessible by some other means.

jrtc27 commented 2 years ago

1 isn't in the cache, it's in DRAM. 2 is in the cache, until you invalidate it, at which point it's 1 again.

jrtc27 commented 2 years ago

and then reappears on subsequent swap-in (step 9) because it's re-written to the cache by the kernel

jrtc27 commented 2 years ago

In the former case, the hart is a consumer (i.e. reader) of data produced by the non-coherent agent

The sequence for this is:

invalidate
initiate + wait for DMA to finish
read data

The act of DMA'ing into the buffer is notionally a write. Yes, this is an external agent, so the MMU doesn't apply here, but whoever gave you access to that DMA-capable device is already granting you the ability to write to the pages, so I don't see why giving write permission from the MMU's perspective is a problem. Not to mention that in a "real" system you'll also have an IOMMU in the way that you'll have to ask the kernel to set up for you with write permission for those pages. But CBO.INVAL allows you to totally bypass that for any page you can read.

dkruckemyer-ventana commented 2 years ago

Still not following. Here's what I think is happening:

1. Userspace writes value 1 to page

The value 1 is either in a cache or in memory. Regardless, that's what's visible to a hart.

2. Kernel marks page as dirty
3. Userspace writes 2 to page

Now the value 2 is either in a cache or in memory. The value 1 could be in another cache or in memory (assuming the value 2 didn't get written to memory), but a hart will only ever see the value 2.

4. Kernel swaps page out

At this point, the value 2 must be written out to disk, since that is the most up-to-date value, and whatever coherence mechanism must ensure that property. Whether any copies remain with either value is system dependent.

5. Userspace reads 2 from page (kernel swaps page in but leaves page on disk)

So the value 2 should be in memory (and any cached copies) and on disk, and the page is clean. Swapping the page in entails invalidating/updating any cached copies, since between steps 4 and 5, the physical addresses could be used for a page from a different context. The value 1 should no longer be readable.

6. Userspace invalidates page
7. Userspace reads back the original 1 because that never left the LLC (unlikely, but possible)

I'm still not sure how the value 1 remains in a cached copy. The only value that the hart should be able to read is 2.

8. Kernel evicts page again, but doesn't re-write out to disk because still dirty

Do you mean still clean (or not dirty)?

9. Userspace reads back the 2, despite having seen the 1 (kernel swaps page in)

jrtc27 commented 2 years ago

5. Userspace reads 2 from page (kernel swaps page in but leaves page on disk)
So the value 2 should be in memory (and any cached copies) and on disk, and the page is clean

This is the point where you go wrong: there's no reason it needs to be in both memory and cache. Polled IO will put it into the cache, as will boring DMA that acts as an LLC client (it probably wouldn't on miss, but on hit sure it'll just update the line), but that line won't necessarily be written back to replace the 1 in memory. Yes, in the "I did a DMA from disk to the actual physical page via the interconnect behind the LLC" case the actual point of coherence will have 2, but that isn't the only case that can happen.

jrtc27 commented 2 years ago

8. Kernel evicts page again, but doesn't re-write out to disk because still dirty
Do you mean still clean (or not dirty)?

(yes, wording got butchered there along the way...)

dkruckemyer-ventana commented 2 years ago

5. Userspace reads 2 from page (kernel swaps page in but leaves page on disk)
So the value 2 should be in memory (and any cached copies) and on disk, and the page is clean
This is the point where you go wrong: there's no reason it needs to be in both memory and cache. Polled IO will put it into the cache, as will boring DMA that acts as an LLC client (it probably wouldn't on miss, but on hit sure it'll just update the line), but that line won't necessarily be written back to replace the 1 in memory. Yes, in the "I did a DMA from disk to the actual physical page via the interconnect behind the LLC" case the actual point of coherence will have 2, but that isn't the only case that can happen.

I get your point. In that case, the value 1 isn't the only value to be concerned about. It could be a bank account number, for example, since the swapped-in page could be located anywhere in the physical address space. At that point, I believe it's the responsibility of the privileged software to ensure one of the following:

The invalidate instruction is mapped to a flush
The copy in physical memory is consistent with all the cached copies (i.e. perform a clean or flush as part of a swap)
The software that can perform an invalidate is trusted enough to observe whatever values it might find

But, given the load value rules, the last one may be out of bounds, since the rules don't really allow random values to appear. (It does appear that the rules need to be tightened up to disallow the scenario where, once an invalidate "rolls back" data to an old value, previously newer values cannot be observed again.)

Ultimately, the execution environment bears some responsibility for ensuring that memory state is consistent and safe with respect to the invalidate. (There's even a non-normative statement to this effect in the current spec, though arguably it's not thorough enough, or even in the best location of the spec.)

dkruckemyer-ventana commented 2 years ago

In the former case, the hart is a consumer (i.e. reader) of data produced by the non-coherent agent

The sequence for this is:
1. invalidate

2. initiate + wait for DMA to finish

To be complete.... software needs to perform an invalidate or flush here, since speculation and/or hardware prefetching can cause soon-to-be-stale data to be cached before the non-coherent DMA writes the memory location.

3. read data
The act of DMA'ing into the buffer is notionally a write. Yes, this is an external agent, so the MMU doesn't apply here, but whoever gave you access to that DMA-capable device is already granting you the ability to write to the pages, so I don't see why giving write permission from the MMU's perspective is a problem. Not to mention that in a "real" system you'll also have an IOMMU in the way that you'll have to ask the kernel to set up for you with write permission for those pages. But CBO.INVAL allows you to totally bypass that for any page you can read.

I understand your argument, and now it's my turn for a contrived example.... :)

The consumer may not necessarily be the driver that controls the device. In this case, you have three agents: the consumer, which has only read permission and simply reads memory; the driver, which may actually have write permission and asks for write permission on behalf of the device; and the device, which writes the memory.

In reality, the general consensus has been that the architecture should guide software away from using invalidate for most use-cases, because the instruction has subtle, undesirable security properties. Note that, for non-coherent producer-consumer communication, a flush is sufficient and can be equally performant; however, the "discard stale write data" case has measurable benefits in some systems, so there were arguments for keeping invalidate in the set of management CBOs.

jrtc27 commented 2 years ago

Why wouldn't the driver do the invalidate then in that situation? It knows best whether the DMA is coherent, and is the one initiating it, so it should be the one to ensure coherence when the agent is non-coherent.

riscv / riscv-CMOs

Interaction between management instructions and dirty bit #45