FOURTH QUESTION: How do we specify which parts of the memory/cache hierarchy are affected by CMOs ?

AndyGlew commented 3 years ago

This fourth question is just to complete the set of questions in rough priority order: (1) address range vs per-cache-block-address? (2) microarchitecture cache index like (set,way) vs ???; (3) what CMO operations (CLEAN! EVICT! DISCARD! ZALLOC? ...); and now this issue: which caches are affected? Or other parts of the memory subsystem?

E.g. do we flush/evict/clean data from the L1 D$? The L2 D$, Shared L3$? L4$ ...

To what level... all the way to DRAM? To NVRAM? To Iron Mountain?

Do we prefetch into L1 D$, filling up all other caches alomg the way from DRAM? Can we prefect into L2$ but not L1$, to avoid thrashing? Some architectures have a small prefetch cache (e.g. HP Assist Cache) that exists to reduce thrashing. Can we prefetch into L1$ without putting it into L2$? Exactly what does "non-temporal prefetch" mean, anyway?

Can we prefetch into other processor caches?

How can any of this be portable code? Iff you say L1$, I (as a microarchitect in a former life) will propose an L0$, or an L0.5$, or an L1.5$ between L1$ and L2$.

While it is fun to write these questions with lots of ? and !, I don't mean to imply that this is a hard problem.

Cache microarchitectures changhe all of the time. Any notion of cache level is quickly obsolete.

But we can enumerate use cases, and for any particular use case, on any given implementation, we can describe what caches need to be invalidated.

E.g. for non-coherent DMA I/O, that writes into DRAM without changuing the cacghes, we will need (a) EVICT operations to flush dirty data to DRAM before the DMA, or maybe DISCARD operations for some embedded systems like network devices; and (b) and some sort of operation to flush stale data out of the caches after the DMA (whether that operation is another DISCARD, or EVICT, or possibly a new operation that I realized was better than EVICT but safer than DISCARD, which I don't have a name for yet.)

But forget about the EVICT/DISCARD/CLEAN/... operations - that is the earlier issue, THIRD PRIORITY question.

For non-coherent DMA I/O that Updates memory without updating cache, the CMOs take data out of any parts of the cache hierarchy, and flush it to DRAM.

But some I/O devices can inject data directly into the cache. If they do that coherently, great, However, this may be happening in a system where processors have L1$ than are not coherent with the L2$. In which case the CMOs operate FROM caches that are closer than the L2$, and TO the L2$. (Although there is no harm in flushing all the way to DRAM, except for for performance).

E.g. some systems do software managed coherency. I am most used to nodes that might have CPUs, each with coherent D1$ and I1$, coherent L2$, and possibly an L3$ that is coherent amongst the 4 to 8 CPUs in in a node. However, there are many such nodes in a system, sharing DRAM; and often the DRAM has its own caches on the other side of the interconnect. In such a system you can used coherent shared memory techniques between processors in the same note, without requiring CMOs, but between the non-coherent notes you will require CMOs.

E.g. I was recently educated about embedded systems, where CPUs have private I$ and D$, and share an L2$ incoherently. but there may be multiple such nodes that have L2$, which are themselves incoherent. So producer/consumer between processors sharing the same L2 need to flush to the L2; but producer/consumer between processors that do not share the same L2 need to flush to whatever is beyond the L2, like DRAM.

Lots of people will jump up and down this and say "to the point of coherence" ... all that I wish to point out is that there are different points of coherence for processors within the same node and between nodes. I.e. the point of consistency is a function of the processors and caches involved.

E.g. hot unplug and sending a processor to sleep requires CMOs that flush all caches on that processor to whatever maintains state - e.g. battery powered laptiop DRAM. You may not need to flush caches on processors that are not being sent to sleep or unplugged. But for noncoherent I/O or software managed coherence you may need to flush all caches to which data may have migrated, or between which processors might have migrated. i.e. sometimes CMOs need to be a local, sometimes global, sometimes of intermediate scope.

E.g. flushing to non-volatile storage or persistence, so that even a battery can be removed. Versus flushing to all RAID copies of non-volatile storage, or even to remote systems, like Tandem Non-stop.

I.e. how far you need to flush to or beyond DRAM may depend on whether you want to survive (a) removing power to a single CPU, (b) removing power to an entire system, (c) loss of one of the NVRAM units, or (d) loss of redundant systems on an entire continent (thermonuclear war, or perhaps just a sub-oceanic cable).

E.g. in security, for timing channels, you typically need to flush all of whatever cache levels you need - both data you have modified and data that might belong to others - either all the way to whatever level you have partitioned storage/caches/memory, or all the way to whatever level you think the timing channels will be low enough bandwidth that you can tolerate. (Gernot Heiser implemented SeL4 which required full L1$ EVICT, but where the L2$ was partitioned by page coloring, so these security flushes only needed to go to the L2. however, many systems do not know how to partitioned the L2, so they would need to flush all the way to DRAM.)

But main channel security differs from most of the other forms of CMOs and that timing channels security requires not just the caches to be EVICTed and left invalid, but also any branch predictors, prefetchers, etc. I.e, the domain of security related flushes increases. There will be seaparte issues as to whether this should be included in CMOs or not.

E.g. for security, for remanence (preventing physical attacks on storage, like freezing DRAM with liquid nitrogen and a cell phone that you have acquired) ...

OK, enough... I am getting punchy. I just wanted to provide examples.

One more example: PERFORMANCE PROGRAMMERS, people that are doing low-level performance tuning, on systems that range from embedded IoT to HPC, may want extremely precise control. They will argue that they would like to be able to prefetch into or flush data out of any micro architectural cache.

All of the other usage models are more abstract. PERFORMANCE PROGRAMMERS often want to bypass such abstractions.

Now, there is some possibility of abstraction even for performance programmers. E.g. the performance programmer may want to flush to the points of coherence between producer/consumer; or to flush to the first cache bigger than a particular data structure.

It is TBD whether we, the RISC-V CMOs task group, will want to provide all of the control that performance programmers would like. (GLEW OPINION: no, except via OS and tech-config discovedry interaction, and a level of indirtection in the CMOs.)

many people argue that performance programmers usually fail; that they think they can do a better job than hardware, but they often waste time and don't achieve much performance. . While this may be true, there are counterexamples. . For a small company, a big customer when that is achieved only by a little bit of performance programming may mean the difference between success or bankruptcy.

===

APOLOGIES: I thought this was going to be a short issue, instead I wrote more than I did in the important issues.

Anyway, I expect that people will talk in terms of things like ARM's Points of Coherence and Points of Persistence:

Point of Coherence (for a single layer of incoherent processors)
Point of Inner/Outer Coherence
Point of I/O Coherence (DRAM for some noncoherent DMA, sometimes...)
Point of Inner Persistence (Battery backed up DRAM or power management)
Point of Outer Persustence (NVRAM... but whether thaat is first NVRAM copy or all NVRAM copy depends)

While these are great for simplifications, as I have several times tried to point out they are not a complete set. there are systems that may have different or multiple layers of points of coherence. The question is, what is a sufficiently good set for RISC-V's markets.

Some people say that we are creating an abstract memory/cache hierarchy. I think this is bogus: we are creating a simplified or restricted abstract memory/cache hierarchy. What good is an abstraction if it fails the moment you become the moment you start doing anything new and interesting? However, such a simple or restricted abstract model is good enough for many systems. I think that it is possible to create a truly abstract model, but that it would require interaction between operating system and configuration/discovery, and a level of indirection in specifying the scope of CMOs that many people are reluctant to provide.

--

OK, I have finally finished the first draft of this issue.

I just wanted to provide a placeholder for how we specify the scope of CMOs, to complete the quad-fecta of what I see as the most important top-level issues for CMO instruction set design.

I apologize for this verbal diarrhea and getting carried away. I apologize to anybody that I have insulted by saying that their abstract model is bogus.

Much of the text written above exists in other places, and probably shouldn't be in this issue; and I will probably remove it and provide references.

The verbosity in part is due to not very good terminology with this issue, and due to my rejection of the common terminology (e.g. Intel, ARM) of one or a few points of coherence as being simplistic and unsatisfactory. Although undoubtedly good enough for many systems.

brucehoult commented 3 years ago

There are so many possibilities here that you'd be crazy to try to encode them into individual CMO instructions. So .. some kind of CSR or other state that can be set and persists over the CMO loop.

Is a HART ID (or set of IDs) an appropriate thing to use to specify the point of coherence? Or even to specify that you want to push something to them? That seems to cover a lot of the cases, but not all.

dkruckemyer-ventana commented 3 years ago

I propose devising an abstract cache model on which the CMOs operate. Without such, it's harder to come up with a consistent set of definitions.

The abstract model can be as simple or as complex as we want, with the corresponding complexity and expressibility in the instructions.

A super simple model compresses all the hierarchy into a single cache level, basically hart -> cache -> memory, where each hart has a cache and every hart shares memory. This captures the notion that we are working with copies of data distinct from the actual memory location. A related set of instructions operates on the "copies" in the caches, e.g. EVICT moves all copies from the cache to memory. I don't pretend this is a complete model, but hopefully it illustrates what I mean.

dkruckemyer-ventana commented 3 years ago

@AndyGlew re: abstract models.... Not offended. Still going to push for one... :)

re: points of coherence.... There are "points of serialization" where request streams from different harts merge. I think this is the fundamental definition of levels, whether a cache exists there or not. In ARM's terminology, the PoC is the ultimate point of serialization. But even if we agree on definitions, discovering them and expressing them generically is challenging.

(FWIW, my super simple model consists of a single PoS/PoC at memory.)

riscv / riscv-CMOs

FOURTH QUESTION: How do we specify which parts of the memory/cache hierarchy are affected by CMOs ? #12