Clarify effect of HFENCE.GVMA on cached VS-stage translations

JamesKenneyImperas commented 2 years ago

The Privileged Architecture specification does not say explicitly whether HFENCE.GVMA has any effect on cached VS-stage translations. Based on the discussion here https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/4LCb61llJUg/m/RLMvDUyfBwAJ?utm_medium=email&utm_source=footer, please could a simple clarification be added to section 8.3.2 (Hypervisor Memory-Management Fence Instructions) of the form:

In implementations with separate G and VS stage TLB caches, there is no requirement that execution of HFENCE.GVMA invalidate any VS-stage TLB entries.

Thanks.

aswaterman commented 2 years ago

I’m not yet convinced the spec supports the interpretation discussed on this thread. Until further guidance is given (hopefully soon but possibly not until January) I advise implementers to assume that HFENCE.GVMA must invalidate any VS-stage translations that would have been affected by the mappings affected by the HFENCE.GVMA.

JamesKenneyImperas commented 2 years ago

Hello Andrew,

If there is to be a delay here, please could you insert an interim note at this point in the specification to that effect, describing the two behaviors under consideration? As it stands, implementers will be making implicit assumptions about the required effect based on their previous experience - for instance, I assumed that the behavior here would correspond to the Arm architecture, in which invalidation of stage 2 entries does not cause stage 1 entries to be invalidated, and I don't think I am the only one who thought this. For those considering migrating from Arm, I think the behaviour you are suggesting might come as an unpleasant surprise.

I'm not a hardware designer, but it would seem to me that the only practical way to conform with a requirement to invalidate any VS-stage translations that would have been affected by the mappings affected by the HFENCE.GVMA is in fact to invalidate all VS-stage entries with matching VMID. Being more specific requires tagging each VS-stage entry with a list of G-stage page descriptions for each level of lookup and traversing these on each VS-stage entry when HFENCE.GVMA is executed, which sounds like a significant overhead in complexity. But, I repeat, I'm not a hardware designer, so I would be happy to be proven wrong.

Thanks.

aswaterman commented 2 years ago

If I could have been more definitive, I would’ve been. For now, I advise people to see my previous response.

gfavor commented 2 years ago

To the original question of this issue, the following statement from the H chapter - and its implications - seems to be pretty clear, at least to me (with my 'bold' highlight):

Executing an HFENCE.GVMA instruction guarantees that any previous stores already visible to the current hart are ordered before all subsequent implicit reads by that hart of guest-physical memory-management data structures done for instructions that follow the HFENCE.GVMA.

This means that caches of combined VS-stage and G-stage information (as well as caches of just G-stage information), must be affected by HFENCE.GVMA.

Practically speaking, most forms of caches of partial or full VS-stage translation information will fall into this "hybrid" category due to the implicit caching of G-stage info because of all the G-stage tablewalk's for translating the GPA's of VS-stage PTE reads.

As Andrew noted, one could do such a cache with a bunch of extra tag information (if one felt that cost/benefit was justified) so that one could do selective invalidation of entries for address-specific HFENCE.GVMA's. Otherwise the more straightforward approach is to invalidate all entries.

aswaterman commented 2 years ago

Greg has pretty much summarized my understanding of the situation. So I guess the question is, should we add some non-normative text to this effect?

aswaterman commented 2 years ago

I’ll also cc @jhauser-us since it appears Greg’s and my understanding is inconsistent with his (or I misunderstood his email).

JamesKenneyImperas commented 2 years ago

Hello Andrew,

I think it would be very helpful to add non-normative text now to explicitly state the implication of the specification as written, i.e. that in most practical implementations execution of HFENCE.GVMA will invalidate all cached VS-stage TLB entries (with matching VMID only, I think). This will deter people from making assumptions based on their past experience.

Having done this, I do think that the specified behavior should be reviewed, since it does not seem to be what some users expect or want.

Thanks.

aswaterman commented 2 years ago

https://github.com/riscv/riscv-isa-manual/pull/794/files seems like the first step in clarifying this. Because HFENCE.GVMA orders implicit accesses done for G-stage address translation, and because VS-stage address translation causes implicit accesses for G-stage address translation, it follows that HFENCE.GVMA must flush any VS-stage translation structures populated by relevant implicit accesses for G-stage address translation.

If y'all prefer, I can rewrite my previous sentence into a non-normative note.

JamesKenneyImperas commented 2 years ago

Hello Andrew,

Yes, I think it would be very helpful to have an explicit note about the required effect of HFENCE.GVMA on cached VS-stage translation structures, as what you are saying definitely isn't what was discussed on the linked email thread with John (though perhaps your discussion with him has moved on since then). Please can you make this as unambiguous as possible, because this is not what many people will be expecting from past experience.

Thanks.

aswaterman commented 2 years ago

I'd like @jhauser-us to agree before I proceed.

jhauser-us commented 2 years ago

I don't think there's any disagreement, but a misunderstanding in the conversation. James Kenney, if I heard you correctly before, you are thinking of a cache of only VS-stage translations, from guest virtual addresses to guest physical addresses, whereas I believe Andrew is thinking of a cache that combines VS-stage and G-stage translations. On the isa-dev mailing list, I wrote:

An HFENCE.GVMA need not affect an address translation cache (TLB) that caches only VS-stage translations.

Andrew can correct me, but I'm pretty sure he does not disagree with that statement. Andrew wrote above:

HFENCE.GVMA must flush any VS-stage translation structures populated by relevant implicit accesses for G-stage address translation.

Note how he's speaking of VS-stage and G-stage translation, because he's thinking of a cache of the combined translations. An HFENCE.GVMA must affect such a cache.

Does that clear up the misunderstanding, or am I the one who's confused? Did I misrepresent anybody's views?

JamesKenneyImperas commented 2 years ago

Hi John,

Yes, you've summarized exactly the behavior I would expect to see if VS-stage and G-stage address translations are cached separately.

I didn't pursue the situation when VS and G stage translation attributes are merged into a single TLB entry in the other thread, as this wasn't what our customers were disagreeing about. I don't like to upset the applecart further, but I think that in that implementation scenario in the Arm architecture, a stage 2 (G stage) TLB invalidation is permitted to have no effect at all on cached TLB entries. I'm not a Hypervisor writer and never understood the logic of that, so what you propose in this case seems sensible to me.

Thanks.

gfavor commented 2 years ago

I still feel like there may be some misunderstanding (although equally I may be wrong). In particular, most forms of VS-stage partial or full translation caching that I can imagine will intrinsically/unavoidably embed G-stage translation information within them. This is because each VS-stage PTE read must be translated by G-stage to a PA before that PTE can actually be read.

Hence most any form of VS-stage-only translation caching must be subject to HFENCE.GVMA.

jhauser-us commented 2 years ago

A hypervisor is normally responsible for providing a valid supervisor-level execution environment for a guest. Although a hypervisor may move some of a guest's main memory pages temporarily to/from disk, or just rearrange pages in physical memory as it chooses, it must not inexplicably modify the contents of main memory pages as seen by a guest, or the hypervisor is not providing a valid supervisor execution environment for that guest.

The RISC-V hypervisor extension exists to improve the performance of "legitimate" hypervisors that do what we say a hypervisor should. We can therefore assume the usual case will be that a hypervisor won't modify a guest's main memory pages without some reason known to the guest. If a guest has some reason to believe the contents of its VS-stage page tables may have been changed by other agents (which may actually be the hypervisor, though the guest may not know that), the guest should itself execute an SFENCE.VMA. But if a guest has no reason to expect the contents of its VS-stage page tables to be overwritten by others, then the hypervisor should not be changing them mysteriously, even if it is quietly moving those guest pages in physical memory. It follows in that case that any cached VS-stage-only address translations (mapping guest virtual addresses to guest physical addresses) should still be valid and do not need to be invalidated.

A hypervisor may act unusually and modify the contents of a guest's main memory unexpectedly (for debugging, for example), but if it does and there's any chance the modified memory is within a VS-stage page table of the guest, then the hypervisor may need to execute a generic HFENCE.VVMA to be safe. But that's separate from the purpose served by HFENCE.GVMA.

aswaterman commented 2 years ago

Agreed, John.

I've attempted to tackle James' request for non-normative clarification. It came out sounding a bit tutorial, but at least I think it's clear. Hope y'all agree. https://github.com/riscv/riscv-isa-manual/commit/2344b1575652f6a62286668678ebca323cc1ec39

JamesKenneyImperas commented 2 years ago

Thanks Andrew, that looks very clear. The only thing I still wonder about is in the merged-VS-G-stage case where you say:

For such implementations, any entry that was populated using a G-stage translation that is ordered by an HFENCE.GVMA instruction must be flushed.

Given John's clear description of the division of responsibilities between guest and hypervisor, it feels to me that the only entries that really need to be invalidated are those that map addresses using that G-stage page - entries that were created using page table walks that traversed it in the past can be preserved, because the hypervisor shouldn't be pulling the rug out from underneath the guest in this way, as John described. This might make the task of implementing more selective invalidation feasible because "all" that needs to be checked is one GPA range (of the original G-stage entry) against the implied GPA ranges of active merged TLB entries. However, this still sounds pretty difficult, given that VS-stage and G-stage entries can be different sizes - invalidating a single G-stage page may require deletion of multiple TLB entries, and information about the original G-stage entry GPA range may be unavailable, so many implementations may still end up invalidating all entries with matching VMID as you say. I wonder if this is why in the Arm architecture they say that in implementations that merge stage 1 and stage 2, there is no requirement that stage 2 tlbi instructions invalidate any merged entries? Is there some trick of Hypervisor design that allows this? Perhaps you and John can track down someone who knows more about this than I do in case an optimization of some kind is possible here.

Anyway, the case with separate VS-stage and G-stage cached entries is now quite clear: thank you again.

aswaterman commented 2 years ago

OK, I'll re-clarify and weaken the "impractical" comment. I agree that since only a GPA tag and page-size metadata need be kept, it's at least conceivable to filter out these flushes. It still seems like an unlikely optimization, since it's fairly costly and is optimizing a case that's hopefully fairly uncommon.

I'm not terribly interested in trying to chase down ARM's rationale for not flushing merged entries. If HFENCE.GVMA didn't invalidate merged entries, then most uses of HFENCE.GVMA would need to be followed up by an HFENCE.VVMA with an x0 address argument (since the hypervisor doesn't know what GVAs might use that GPA). This would seemingly be a de-optimization, as it would render useless the kind of optimization you're describing.

(And indeed, upon checking the KVM source code, they always follow up an address-specific HFENCE.GVMA with an HFENCE.VVMA x0. No bueno.)

JamesKenneyImperas commented 2 years ago

I think this is now clear in all respects, so closing this issue.

riscv / riscv-isa-manual

Clarify effect of HFENCE.GVMA on cached VS-stage translations #793