Open AndyGlew opened 4 years ago
Agreed. Iterating over the cache can sometimes be better than iterating over an address range. And this form provides flexibility in implementation.
Manufacturers of cores could if they wish document the encoding scheme from sets and ways or whatever they have into abstract indexes, thus allowing non-portable code to operate on a single way (or whatever).
I'm not a fan of including micro-architecture specific encodings or manufacturer-specific abstractions in the general-purpose ISA.
What is the use case, and what value would make it worthwhile for a manufacturer to make their micro-architecture-specific cache ops (set+way, if that's what they built) fit into an architecture-level abstraction?
Twist: I would be a fan of an "ALL" variant, instead of set/way/uarch-range.
If we're going to approach this, I see two issues that are at a conceptual level above instruction definition.
Second is protection. Instructions could be restricted to M-Mode with delegation capability to S-Mode. Another possibility is to use stores to MMIO space and have MMU/PMP control access, which gives more flexibility over the long run.
Bill
Just like an earlier issue discusses address range CMOs vs per-cache-line CMOs... but this time for operations that are typically used for things like "flush the entire I$ or D$".
Such "cache microarchitecture dependent CMOs" have been done in some earlier processors a cache line at a time --- but this is less well established than for peer-cache-line-address-at-a-time. Quite a few RISC processors have "full cache flushes", etc.
First, if operating a cache line at a time, there must be a way of indicating which cache line is involved. Typically this is (set,way), but not all caches have sets and ways - indeed, it is not really clear what the set and ways are for something like a skewed associative cache.
But that's okay, we can abstract that as a "cache entry index number", which might be Set*Nways+Way for a traditional set associative cache, or whatever is appropriate.
Then, a per-cache-index loop typically looks like
or
That's the traditional approaxch.
The draft proposal (by me, Andy Glew, TBD link here3) defines "microarchitecture range CMOs" that look like
which looks remarkably like the per-cache-index loop
except that, like in the CMO.AR proposal, the next cache index is returned by the CMO.UR instruction.
This allows severral implementations
(1) per (set,way) cache line at a time - traditional
(2) trap to M-mode efficiently, less overhead
(3) state machines that iterate over the entire cache, e.g. for EVICT, to write out dirty data
also (3.1) non-state machine impl;ementations, as in bulk invalidations that set all valid bits to 0 as a single operation.
I mark this as a SECONDARY QUESTION:
in the title, because I want it to be blaringly obvious
also becausde I am in a hurry, and will apply this issue tracker's priority scheme later
but mainly because I think there will be less discussion about this CMO.UR cache index range than there will be for the CMO.AR address range instruction.
since there are already quite a few implementations that are "full cache invalidations", and we want RISC-V to support such hardware when it is available.
--
again, this issue is not for the details of the CMO.UR. It is mostly for the idea of a midfroarchitwecure or cache index range.