Reconsider ASID design - Githubissues

nbdd0121 commented 5 years ago

I already posted it to the group but I think it might be better to have an issue here for tracking as well. The current ASID design suggests that ASID are local to each harts, which could severely penalise designs with multi-level TLB designs, shared TLBs, hyper-threading, virtual cache, etc. In the post I already mentioned about disadvantage SBI/IPI/OS-wise about private ASIDs, I would add a few points here.

Also I see the current ASID is under-tested:

Spike has no ASID support and correctly exposes it to software.
QEMU has no ASID support but does not zero-out ASID bits, so software thinks it has 16-bit ASID support. The execution will not be wrong however because QEMU flushes TLB for each SATP write.
As far as I know no taped out chips have ASID support.
RISC-V Linux port has no ASID support.

It would be irrational to ratify the current privileged spec with a feature that's only on the paper, and is not even simulated.

I suggest the following minimal modification to the semantics of ASID. It would require no changes to hardware that does not want to support ASID or uses only single-level TLB, and it requires no changes to software that does not want to utilise ASID.

ASID 0 is local to each hart.
All other ASIDs are globally visible.

aswaterman commented 5 years ago

My main concern is the one I wrote over email, which wasn’t satisfacorily addressed. How do two OSes that don’t trust each other run in the same system, in LPAR-style fashion, isolated with PMPs? What prevents them from using each other’s ASIDs? As specified, this is a security hole, as it allows the OSes to alter each other’s TLB entries. This is not a hypothetical use case for existing RISC-V systems.

I don’t have a clear picture as to the other implications of cross-hart ASIDs. There is something worrisome about incoherent caches being updated by multiple harts, especially given that speculative TLB refills are allowed. I am thinking in particular about side-channel attacks. This is, of course, a gut feeling, not a technical argument.

The VM system design is deliberately conservative and incremental, and to my knowledge, hart-local ASIDs have been the norm until this ARM64 design you mentioned. MIPS also has support for sharing TLB resources between harts in a core, but it is much more limited - it’s not actually global - and it requires a mode switch, so the local option is still provided.

You’re of course right that this idea should be carefully considered prior to ratification. I’ll raise this matter before the Foundation as well.

nbdd0121 commented 5 years ago

You're correct that my proposed change does not consider LPAR use case.

The LPAR use case is indeed an issue that needs to be addressed. Hypothetically it can be addressed by adding a new CSR which contains a mask indicating ASID bits allowed for the supervisor, and preset value for the rest part of ASID. Then the effective ASID can be easily computed as (ASID & mask) ^ value. The default value of "value" could be set to hart ID, so software that does not want to use ASID at all will need to take care.

I understand that the VM design is conservative and incremental, however the current design is conservative to an extent that it might rule out many possible microarchitectures. Just like how RISC-V decides to use weak memory model, while sequentially consistency or TSO is more conservative.

avpatel commented 5 years ago

Even in ARM, the ASID space of one CPU will be exposed to another if we try to partition the system among multiple OSes without using ARM virt extension.

The best way to achieve clean partitioning of a system among multiple OSes is to use RISC-V hypervisor extension with each OSes having its own VMID. The TLBs will be tagged with both VMID and ASID so there will be no overlapping TLB entries possible.

Due above reasons, even Jail House partition manager uses arch support for virtualization.

nbdd0121 commented 5 years ago

Even in ARM, the ASID space of one CPU will be exposed to another if we try to partition the system among multiple OSes without using ARM virt extension.

Thanks for confirming the my understanding of ARM's ASID design.

The best way to achieve clean partitioning of a system among multiple OSes is to use RISC-V hypervisor extension with each OSes having its own VMID. The TLBs will be tagged with both VMID and ASID so there will be no overlapping TLB entries possible.

In RISC-V this actually could be done without VMID. In ARM ASID is either 8-bit or 16-bit, but in RISC-V it can vary, and it is determined by writing 1s into all possible locations and reading back to see which values are valid. Therefore an hypervisor can easily partition ASIDs by trapping SATP accesses and make guest OS think the ASIDLEN is shorter (effectively dividing ASID into a VMID and a guest ASID)

Due above reasons, even Jail House partition manager uses arch support for virtualization.

It seems that LPAR-style partition is one use case they definitely want to support, in which case global ASID space does not work at the moment without other tweaks.

nbdd0121 commented 5 years ago

How do two OSes that don’t trust each other run in the same system, in LPAR-style fashion, isolated with PMPs? What prevents them from using each other’s ASIDs? As specified, this is a security hole, as it allows the OSes to alter each other’s TLB entries.

Just realised that this is actually possible without HS-mode. We can simply set TVM=1 and trap SATP/SFENCE.VMA access, tagging ASID, and then we are done. There're actually no need of shadow page table because the protection is done in PMP already.

aswaterman commented 5 years ago

Yes, that's possible, but trapping those operations reduces performance nontrivially.

@avpatel The fact that RISC-V can do this without hypervisor extensions is a material advantage of RISC-V. People already make use of this.

nbdd0121 commented 5 years ago

Yes, that's possible, but trapping those operations reduces performance nontrivially.

I would doubt that. SATP/SFENCE.VMA are not really performance critical instructions. SATP is only executed on (inter-process) context-switch, which is already an expensive operation, and trapping to M-mode consumes only extra a few cycles. SFENCE.VMA is even rarer, only executed when page is demapped or ASID is reused.

The effect of having treating ASIDs as a global space can non-trivially improve performance, by, e.g. allow multi-level shared TLBs, simultaneous multiprocessing, etc. As RISC-V envisions from microcontroller to super-computer, I can only see the benefits of global ASIDs as it opens much more design choices of virtual memory subsystem up to chip designers.

I hope you agree about the benefit of global ASID space. If you have concerns about the performance drawback of trapping these, I might do some profiling with Linux + real life workload to get out some statistics about frequency of SATP/SFENCE.VMA and related performance impact.

aswaterman commented 5 years ago

It's not "a few extra cycles." After the trap, you need to spill a few registers, decode the trap cause, dispatch to the illegal-instruction trap handler, obtain the instruction word, decode the instruction, re-execute sfence or satp (the easy part), reload the spilled registers, and return. This could easily be as expensive as the register-context switch (both in terms of extra cache misses and instructions executed).

You also need to obtain the rs1 and/or rs2 arguments for sfence and the satp write, which involves a jump table and possibly another I$ miss.

nbdd0121 commented 5 years ago

I still think it's not really an issue. The frequency of SATP/SFENCE.VMA usage is like 100-200Hz, so it isn't really going to impact the performance that much. Most OSes will consider a memory address space switch as an expensive operation. The current Linux's RISC-V port has no ASID support whatsoever, so a context-switch is a complete flush of TLB, which could be much more expensive.

Even if that infrequent usage is really a concern, then we could easily add an extra CSR, e.g. MASIDMAP, which contains a mask and a value to be XOR-ed. If the hardware decides that it does not need global ASID or it does not want to support LPAR, it could simply not implement this CSR. If a hardware wants both it can just implement it. This CSR is very low-cost in terms of hardware resource (just a extra CSR plus an 16-bit wide AND and XOR), but it opens up a huge design space.

aswaterman commented 5 years ago

The SFENCE.VMA usage is far higher than that for some applications, because it's used not only for context switches but also for page-table modifications. For other applications, it is as you say.

I agree the ASID-mask CSR is another rational solution.

For now, I'm going to table this discussion, and will report back in a while with the task group's decision.

avpatel commented 5 years ago

Atleast from Linux perspective, having faster SATP/SFENCE.VMA is important for faster context-switching. Any workload which spawns large number of processes will see a reduction in performance if we trap SATP/SFENCE.VMA.

I agree with @aswaterman that it will very useful to have raw partitioning (i.e. partitioning without use of hypervisor extension) supported in RISC-V.

The ASID-mask CSR looks good to me but please consider role of this CSR in presence of RISC-V hypervisor extension as well. We should keep the VMID in GSATP even if we are going for separate ASID-mask CSR.

nbdd0121 commented 5 years ago

@avpatel I thought about the interaction between this CSR and hypervisor extension. The added CSR will interact properly with the hypervisor extension if VMID is hart-local. The ideal case for hardware devs is that VMID is a global shared space as well - in which case the current CSR proposal is inefficient. So perhaps a MVMIDMAP is going to be necessary?

I can think of two other ways to make this work:

Simply let the additional CSR act as a few extra bits of ASID/VMID that is set in M-mode. So the effective VMID is just {CSR, VMID}, and the effective ASID is {CSR, VMID, ASID}. Again, this CSR can be hard-wired to 0 and is not necessary for any hardware which decides that global ASID or LPAR is not needed.
For hardware that has no H-mode support, just let the additional CSR be the mask on ASID as described. If the hardware has H-mode support, instead let the additional CSR be the mask on VMID (even if H-mode is not used). As the hardware will have entries tagged by both VMID and ASID, this will still provide correct partition.

nbdd0121 commented 5 years ago

After thinking a bit more about this issue, and its relationship with global pages (also I've discovered #350 during the process), HS-mode, I think I've found a good approach to address the issue. I've write it down as a proposal:

Motivation:

Global ASIDs allow huge variety of design decisions, e.g. shared multi-level TLBs (particularly useful for multicore scenario; papers show that by having larger shared TLBs can reduce TLB miss rate by 80%), and virtual cache, etc.
For OSes, hart-local ASIDs make little sense because it adds extra complexity for mapping ASIDs between harts.
SBI-provided sbi_remote_sfence_vma_asid makes sense for shared ASID-space-only. Those who implemented global ASIDs can also provide architecture-dependent method of remote TLB shootdown and provides the feature as SBI, for better performance (because it reduces IPI).

Things to address:

LPAR-style separation does not work with global ASID at the moment.
We don't want software to be forced to use ASID.

Proposed solution:

Specify the ASID to be a global space.
Add an extra CSR, MASIDI (Machine ASID Isolation CSR), which is a WARL register
- A hart can share TLB entry with another hart if: MASIDIs are identical, VMIDs are identical, and ASIDs are identical (or if the page is global).
- The number of bits to implement is implementation-defined, let's say it is N
- Unimplemented bits can be hardwired constants, and can be non-zero.
- Least significant bits are implemented first.
- Number of implemented bits can be determined by writing 0 to MASIDI and read back, writing all 1s to MASIDI and read back, and XOR them together.
- An implementation that does not utilise global ASID space can hardwire MASIDI to hart ID (or any unique number), and it requires no further hardware change. This means the extra HW-cost is almost zero.
- An implementation that utilises global ASID space but does not want to support LPAR can hardwire MASIDI to 0. The extra HW-cost is also almost zero.
- Only the implementation that utilises global ASID and wants to support LPAR needs to implement this extra register (and a few extra bits in their TLB).
- Note that this design also allows some harts to share ASID space while some not.
A software that does not want to utilise ASID can set ASID to hart ID when setting SATP, which will always be correct.
An M-mode code that does not want to do LPAR can simply ignore about MASIDI.
An M-mode code that does LPAR can use MASIDI to provide ASID isolation.

aswaterman commented 5 years ago

As with previous arguments, this presupposes that global ASIDs are Pareto-superior, and understates the value of hart-private ASIDs. In systems without hardware TLB shootdown support, which RISC-V currently is--and will remain, for many systems--the cost of shootdowns is a dominant effect. Private ASIDs give software more flexibility to map them in a way that it can aggressively filter who needs to be shot down. (Yes, this might mean not using the SBI calls. That's fine; an OS is welcome to handle shootdowns itself.)

This effect is significantly reduced with hardware shootdown support, because shootdowns become fairly unintrusive to the other harts. Broadcast hardware shootdown support obviously relies upon ASIDs having consistent meaning across harts. Once you've gone down that road, I agree it makes sense to take advantage of global ASIDs in the form of shared translation caching.

In other words, broadcast hardware shootdown, global ASIDs, and shared translation caches work well together, and are not as effective individually. So, our current thinking is that we should propose an extension--not a change to this specification--which will address these issues in tandem. We can do this in an opt-in, compatible way.

aswaterman commented 5 years ago

By the way, I think your MASIDI proposal is solid and should be part of that extension.

nbdd0121 commented 5 years ago

@aswaterman I think the default should be other way around.

As I previously argued there are many cases which opt-in isn't realistic - e.g. many-core or virtual cache scenario. For virtual cache coherence cases ASID must have global semantics or the whole thing breaks. For many-core scenario they each can only hold very tiny TLBs due to limited energy budget and space, so for reasonable performance multi-level TLBs are needed.

I understand that in some cases we might want more flexible, hart-local ASIDs. But Linux isn't one of them. When adding ASID support to Linux I realised hart-local ASIDs are very expensive to manage, so I ended up sticking with a global ASID allocation even though the current semantics are hart-local. (To be more technical, if we use hart-local ASIDs then each hart needs its own ASID allocation table, and each mm_struct needs one ASID field per each CPU, where in global ASID allocation we only need one ASID allocation table and a single ASID field, and we can use Linux-builtin cpumask to track active CPUs and also SBIs calls for remote TLB invalidation).

Global ASID semantics is less restrictive for hardware and more restrictive for software - this is similar to the reason why we prefer weak memory models to TSO. In other words, private ASIDs should be an extension (support of which can be clearly indicated by hardwiring MASIDI). In those systems where global ASID space is a must, we can't really have an extension to remove hart-local ASID support requirement.

aswaterman commented 5 years ago

We can place a restriction in the specification on how software must use ASIDs to guarantee forward compatibility with future standards. Something along the lines of what you suggested is fine: for forward compatibility, OSes must only use nonzero ASIDs to mean the same thing across all harts.

This does not require specifying the new mechanisms, semantics for virtual caches, etc., and avoids changing the current hardware spec, but allows for those to be specified in the future while supporting existing OS. This allows us to have it both ways.

nbdd0121 commented 5 years ago

I agree that we can impose restriction on software first and leave the MASIDI thing as a potential future extension. This is perhaps better than abruptly introducing new CSRs before ratification. There are still some issues though:

Now all spec-compliant OS will work on a hardware with global ASID space, but can we claim such hardware is complaint to the spec?
Do we want to make ASID 0 the special one, or shall we treat all ASIDs as being equal? Shall the restriction be changed to "OSes must use ASIDs to mean the same thing across all harts provided that ASID is supported" (so they set ASID fields to hart ID for correctness and future-proofing)?

aswaterman commented 5 years ago

OK, I think we're making some progress.

1) Not just yet, but soon. We are not yet specifying that ASIDs are global; we are stating a constraint on software so that a future standard or standard extension can do so without breaking software compatibility. We should work on this new specification soon after ratification, so that such implementations can soon be considered compliant with respect to version 1.12 of the spec.

2) I'm sympathetic to avoiding corner cases, but it's more important to continue supporting existing OSes that use the "no-ASID" model on future hardware. As it happens, they all use ASID 0 in this case.

ingallsj commented 5 years ago

Even in ARM, the ASID space of one CPU will be exposed to another if we try to partition the system among multiple OSes without using ARM virt extension.

Tangential clarification: the description of ARMv8.2-TTCNP suggests otherwise: that the ASID space is local to each hart when TTBRn_ELx.CnP=0, which is the default value prior to the ARMv8.2-TTCNP extension.

nbdd0121 commented 5 years ago

Tangential clarification: the description of ARMv8.2-TTCNP suggests otherwise: that the ASID space is local to each hart when TTBRn_ELx.CnP=0, which is the default value prior to the ARMv8.2-TTCNP extension.

@johningalls-sifive Thanks for pointing out. It seems that prior to ARMv8.2 , ASID space is all local however after ARMv8.2 they make TTCNP mandatory, and their firmware also enable CnP if TTCNP is detected. My original understanding of ARM's ASID model comes from reading their Linux source code, so perhaps they're assuming an ARMv8.2 in source code already. See post below

dkruckemyer-ventana commented 5 years ago

Though in maddening ARM documentation fashion, the ARM ARM also states (Section D5.9.1 definition for ASID):

"For a symmetric multiprocessor cluster where a single operating system is running on the set of processing elements, the ARM architecture requires all ASID values to be assigned uniquely within any single Inner Shareable domain. In other words, each ASID value must have the same meaning to all processing elements in the system."

I expect this definition was mostly by convention to support hardware TLB maintenance (for both processors and SMMUs), i.e. ASIDs are not local prior to ARMv8.2. The ARMv8.2 extension seems to add more formal, differentiated behavior between when ASIDs can be treated as local vs. global, and the default simply does not break existing software. (Not quite sure, ARMv8.2 was after my time on ARM....)

aswaterman commented 5 years ago

Thanks for raising this issue @nbdd0121, and to everyone else who has chimed in. The privspec group has decided to levy a forward-compatibility constraint on software so that a future ISA extension can address this matter. We've written the constraint in such a way that avoids mandating, but leaves open the possibility of, ASID 0 being a special hart-local ASID.

riscv / riscv-isa-manual

Reconsider ASID design #348