riscv / riscv-CMOs

https://jira.riscv.org/browse/RVG-59
Creative Commons Attribution 4.0 International
78 stars 12 forks source link

FIRST BIG QUESTION: Address Range vs per-Cache-Line CMO instructions #9

Open AndyGlew opened 3 years ago

AndyGlew commented 3 years ago

In my opinion, this is the first big question that the CMO group needs to answer. Top priority because (a) it is a big source of disagrteement, (b) the J-extension I/D proposal by Derek Williams wants to follow the CMO group, (c) because the decisioon has big implications for code portability, for legacy compatibility with RISC-V systems already built, and for building systems where the CPU IP is developed independently of the bus IP and external cache and memory IP - i.e. for system "mash-ups".

Should we provide traditional RISC cache-line-at-a-time instructions, like POWER DCBF, DCBI, DCBZ, ... Not just RISC, but also CISCs like x86's CLFLUSH.

Basically, of the form CMO <memory address>. However, probably not of the form CMO rs1, Mem[rs2+imm12], because such 2reg+imm formats are quite expensive. If we were to do per-cache-line operations, would probably be of the form CMO rs1:cacheline_address.

Or should we provide "address range" CMO operations?

The draft proposal (by me, Andy Glew - TBD link here) contains a proposal for address range CMOs. Actually, it is a proposal for an instruction that can be implemented in several different ways, as described below. This CMO.ASR.* instruction (AR=address range) is intended to be used in a loop that looks like

   x1 := start_address_of range
   x2 := end_address_of range
loop:
   x1 := CMO.AR x1, x2
   BNE rs1, rs2, loop

(This is just an example, although IMHO the best. Other issues will discuss details like [start,end] vs [start,end) vs [start,start+length) vs ... But many iof not most of tye address range proposals have a loop like the above, varying in minor details like BNE vs BLT vs ...)

It can be implmented in different ways

(1) per-cache-line implementations, i.e. the traditional RISC way,

rs1 contains an address in the cache line to be invalidated. an address in the next cache line is returned in rd. (my proposal requires rs1=rd, in order to be restartable after exceptions like page-faults without requiring OS modifications, but that can be tweaked)

(2) trap to M-mode, so that can be emulated on systems where idiosyncratic MMIOs and CSRs invalidate caches that the CPU IP is not aware of;

KEY: the M mode software can perform the entire address range validation, and thus drop overhead than if it had to trap on every cache line or DCBF block

(3) using state machines and block invalidations, i.e,. using microarchitecture techniques that may be more efficient than a cache line at a time.

These can apply the CMO to the entire address region; but if they encounter something like a page-fault, they stop so the OS can handle it. i.e. they are restartable.


it is not the purpose of this issue to discuss all of the details about which register operand encodes which values, or whether the loop closing test should be a BNE or a BLT, or whether the and address should be inclusive or exclusive. those undoubtedly will be subsequent issues

this issue is mainly for the overall question: should be RISC-V CMOs be traditional per cache line operations or should they be address ranges using the approach above that allows per cache line implementations

brucehoult commented 3 years ago

I totally agree with the basic principle of specifying the full address range you want to work on, and the hardware telling you how much of it it actually did.

Trapping if there is a problem is of course one option, but I'd like to see a way to allow the software loop to learn there is a problem and handle it. I've made a suggestion for this in #8 and on the mailing list.

strikerdw commented 3 years ago

I have a proposal (coming later this week) that runs pretty much counter to most of what is said and suggested here (it's 89 slides long and goes into great detail).

It'll be out sometime this week.

And the answer to the mainline question is that the basline CMO (I call them Cache Block Operations CBOs -- with emphasis on Block) operations should be one cacheline at a time.

Where it's safe to do so and well defined, we can create something I call an MRO which has arbitrary byte ranges on it. I'd implement these as an instruction that always traps and runs a seqeunce of CBOs or stores. More details to come.

Derek

ingallsj commented 3 years ago

I wonder whether specifying address ranges and returning the number of bytes affected is too CISC-y. Some implementations may want it, but single-cache-block operations will get most of the implementations most of the way there.

Requiring rs1=rd seems like a waste of opcode space.

I look forward to @strikerdw 's write-up!

brucehoult commented 3 years ago

On Tue, Sep 15, 2020 at 6:28 PM John Ingalls notifications@github.com wrote:

I wonder whether specifying address ranges and returning the number of bytes affected is too CISC-y.

I can't see how it's CISCy if it takes 1 clock cycle and affects 1 cache line. Arbitrarily complex 2R1W register-to-register logic is fine for a RISC instruction, as long as it's combinatorial, not sequential.

Requiring rs1=rd seems like a waste of opcode space.

Maybe. It would mean that 97% of the room in that opcode space would be available for other instructions that never want rs1=rd.

ingallsj commented 3 years ago

Arbitrarily complex 2R1W register-to-register logic is fine for a RISC instruction

That's fine for arithmetic and in-order processors which share the same register file for all instructions, but comes with added costs or complexity for memory instructions in larger designs. To illustrate, let's consider the BOOM core [1] (details are different in other pipeline arrangements / proprietary microarchitectures, but the themes still apply). BOOM has one physical register file read port [2] per load/store address calculation (x2 issue) [3]. A straightforward implementation of this new register+register memory addressing mode in BOOM would add another read port to the physical register file and another fanout to the data forwarding bypass network. Granted, adding this functionality is straightforward, and the larger cost is only imposed on larger designs, i.e. the cost is proportional to the overall size, and there are trade-off techniques to bring the cost down but those add more complexity. Again, this isn't free, and it doesn't "Just Work (TM)" with existing plumbing like fixed-block-size CMOs would do.

Do CMOs really need to drive us to introduce a new memory addressing mode (register+register)?

[1] https://github.com/riscv-boom/riscv-boom [2] https://docs.boom-core.org/en/latest/sections/reg-file-bypass-network.html [3] https://docs.boom-core.org/en/latest/sections/load-store-unit.html

brucehoult commented 3 years ago

"Do CMOs really need to drive us to introduce a new memory addressing mode (register+register)?"

I don't think so.

Furthermore, I don't see the connection you're making here.

Specifying a desired address range in two registers isn't register+register addressing. The address used for the cache block operation is the address in the first register.

On Wed, Sep 16, 2020 at 4:43 AM John Ingalls notifications@github.com wrote:

Arbitrarily complex 2R1W register-to-register logic is fine for a RISC instruction

That's fine for arithmetic and in-order processors which share the same register file for all instructions, but comes with added costs or complexity for memory instructions in larger designs. To illustrate, let's consider the BOOM core [1] (details are different in other pipeline arrangements / proprietary microarchitectures, but the themes still apply). BOOM has one physical register file read port [2] per load/store address calculation (x2 issue) [3]. A straightforward implementation of this new register+register memory addressing mode in BOOM would add another read port to the physical register file and another fanout to the data forwarding bypass network. Granted, adding this functionality is straightforward, and the larger cost is only imposed on larger designs, i.e. the cost is proportional to the overall size, and there are trade-off techniques to bring the cost down but those add more complexity. Again, this isn't free, and it doesn't "Just Work (TM)" with existing plumbing like fixed-block-size CMOs would do.

Do CMOs really need to drive us to introduce a new memory addressing mode (register+register)?

[1] https://github.com/riscv-boom/riscv-boom [2] https://docs.boom-core.org/en/latest/sections/reg-file-bypass-network.html [3] https://docs.boom-core.org/en/latest/sections/load-store-unit.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-CMOs/issues/9#issuecomment-692836124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGPYYD64ODD7TGBPAZM3ATSF6KO7ANCNFSM4QTSAJKA .

ingallsj commented 3 years ago

You're correct, Bruce, and I apologize for my imprecision. I'll use the notation "register<=register" instead.

The connection is that the address used for the cache block operation in the first register is compared against the end address in the second register. This may either be done at added cost in the same pipeline as the CMO, or in different pipelines at added complexity cost.

brucehoult commented 3 years ago

I agree with that, and that a comparison of two 64 bit values (or even two 58 or 59 bit values after you mask off the low bits) is expensive.

That's one reason I prefer base+desired length as the arguments, and return achieved length (which is equal to the cache block size in the common case that the base is aligned, the desired size is >= the cache block size, and the machine operates on one cache block per execution of the instruction)

I haven't worked through the exact details, but I think it's not difficult to compute the achieved length in the cases where it is not the cache block size, and for sure it involves only 5 or 6 bit adders, not full address width.

The achieved length is then added to the start address to get the next start address, and subtracted from the desired length. By instructions the user writes (or more likely, their library author)

Returning the achieved length has the additional advantage that results such as 0 or negative values are available to tell the user's driver code that the hardware didn't do anything.

I won't preempt Derek's proposal, which would work also, at slightly lower hardware cost but with quite a few more instructions in the user program loop.

it's a continuum, and hard to say what is best.

The fast interrupts group have chosen to make the hardware just a little more complex in order to reduce the number of instructions and clock cycles to get to the user's handler. The Vector extension uses a "tell the machine the desired length and it tells you the achieved length" mechanism with the user code adding the (scaled) achieved length to all the data pointers and subtracting it from the desired length.

On Wed, Sep 16, 2020 at 12:09 PM John Ingalls notifications@github.com wrote:

You're correct, Bruce, and I apologize for my imprecision. I'll use the notation "register<=register" instead.

The connection is that the address used for the cache block operation in the first register is compared against the end address in the second register. This may either be done in the same pipeline at added area cost, or in different pipelines at added complexity cost.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-CMOs/issues/9#issuecomment-693094113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGPYYAXDBFLCY7CMPU2723SF762HANCNFSM4QTSAJKA .

PhilNMcCoy commented 3 years ago

Still working my way through strikerdw's proposal & following discussion, but I wanted to mention this for the record:

There is real demand for the performance gain of using a hardware FSM to flush caches rather than a software loop. I've been involved first-hand in the following implementations in CPU IP products (which were all directly motivated by customer requests):

I would much prefer to have this standardized so that portable software can use it and get whatever performance benefit a particular CPU implementation has to offer.

I agree that CMO instructions should be defined in a way that cleanly degenerates to a simple operation on one cache line in simple CPU pipelines. I really like the way that a range-based instruction can take just one trap to emulate a range-based operation (via instruction, CSR, memory-mapped) rather than trapping on each line.

Some other misc comments so-far from strikerdw's proposal (sorry if these have been mentioned previously; still catching up on that discussion):

Cheers, Phil McCoy

gfavor commented 3 years ago

On Tue, Sep 22, 2020 at 11:08 AM PhilNMcCoy notifications@github.com wrote:

There is real demand for the performance gain of using a hardware FSM to flush caches rather than a software loop. I've been involved first-hand in the following implementations in CPU IP products (which were all directly motivated by customer requests):

  • FSM to flush L1 cache, controlled by writing to the equivalent of CSRs
  • FSM to flush L2 cache, controlled by writing to memory-mapped registers
  • FSM to flush L1 cache, controlled by CMO instruction None of these implementations were standardized per the ISA, and each was specific to a particular CPU implementation. Portable software cannot count on any of the above being available unless it knows exactly which CPU microarchitecture/pipeline it is running on.

I would much prefer to have this standardized so that portable software can use it and get whatever performance benefit a particular CPU implementation has to offer.

It seems like the preceding specifically do not want to be address or address-range based operations, but set/way-based operations that flush whatever addresses are found in the cache entries. Which then falls outside of what Derek's slides were trying to focus on.

A common use case for the above is power management (e.g. entering deeper sleep states in which caches need to be completely flushed) - for which a "flush all sets/ways" operation is desired. But this could be done efficiently in a careful software loop (using block set/way operations) by the hart who owns those caches (i.e. the total flush time should be limited more by all the cache flushing activity, than by the loop code itself). Only if some other entity needs to perform this operation would the argument for a hardware FSM be stronger.

But I'm not really trying to argue against hardware FSM approaches. Instead I would note that (I think) the use cases for "flush all sets/ways" operations are in platform-specific code and hence have a weaker need for ISA standardization. But this really gets into the broader topic of standardizing set/way CMOs - which can be separated from the topic of address-based CMOs.

  • I don't agree that DCBZ is necessarily the preferred way to delete ECC errors from caches; I'm not sure it's even desirable to try to define a standard/portable way to do this (what if you have a strange cache microarchitecture, or if the ECC error is from the TLB or branch prediction table or what if ... )

I would have said that this IS the preferred way to remove a poisoned line from the coherence domain. (Note that this isn't simply about cleaning out an ECC error in a cache entry - a set/way operation on that cache would handle that.) The problem is that the only way to "un-poison" a poisoned line floating around the cache/memory hierarchy is to coherently overwrite the entire cache line in some reliable way. Simply doing ISA stores to the entire line is uarch-dependent and even then not necessarily reliable.

Greg

dkruckemyer-ventana commented 3 years ago

It seems the real performance gains to flushing an entire cache (or possibly even a range?) would be obtained from making the operation asynchronous to the rest of instruction execution. For full cache operations, you don't have to worry about address translation (and the possible translation traps that result), so that seems achievable.

In general, though, it's still not clear to me how a range instruction provides a performance benefit, especially if you're tying up cache resources and execution resources to execute it. (Granted, the trap & emulate case goes faster because you're passing the range to the handler once instead of a trapping per op, but aren't there cheaper ways of obtaining the same result, e.g. an SBI call? And I'm curious about the kinds of designs where this method provides a real performance advantage, since this style of operation provides no benefit to the types of designs I work on.)

Anyway, I'm looking forward to a robust conversation on this topic in the near future.... :)

billhuffman commented 3 years ago

I agree with Phil McCoy as I also have experience with performance requirements that are not well met by block-at-a-time instructions. The ability to process blocks rapidly in complex cache structures helps performance (when there's an on-chip backing store).

I agree with dkruckemyer-ventana that there is more performance to be gained from flushing in the background, but then some construct needs to be added to know when the flushing is done, which tends to require more than just an instruction. Concepts also need to be developed such as looking up but not allocating cache lines. And if it's for a region only, there is a question of whether accesses to the region are stalled while the flush is ongoing.... Maybe we should approach the concept, but it is a bigger thing.

Another possibility is to do only non-required operations in the background. These might be done using rd=x0 as a marker that the instruction is a hint. If the operation is not required for correctness but as an optimization, it can be done in the background. The x0 result register says that, architecturally, the hart is not required to wait for the result or even to do the operation at all. Cache management for I/O purposes would not usually meet this requirement, but cache management for better performance often would. Simpler implementations do a single cache line - or nothing at all.

  Bill
PhilNMcCoy commented 3 years ago

I would have said that this IS the preferred way to remove a poisoned line from the coherence domain.

What if the error is in the cache tag RAM - how do you know which address to target with your DCBZ, and how does the hardware know which way (if any) has that address? I don't want to digress into a big debate about ECC - I don't think it's part of the charter for this workgroup anyway and I don't want to delay ratification of useful CMOs. Suffice it to say that we're not in universal agreement that DCBZ is the One True Way to handle ECC.

In general, though, it's still not clear to me how a range instruction provides a performance benefit

Running the FSM in the background while the CPU gets on with other work is part of it (especially with multithreaded CPUs, L2 caches, etc.). The other part is that you can pipeline the cache RAM accesses more tightly when the control logic knows it is working on a contiguous range of addresses.

Even an unrolled loop of CMO Clean operations would have pipeline stalls (example assuming 64B lines) CMO Clean 0 CMO Clean 64 CMO Clean 128 CMO Clean 192

could look like this in the pipeline (each line is a clock cycle):

Read tag 0 wait-state compare tag 0 write tag 0 (probably several cycles of bubbles, depending on uArch pipeline interlocks, etc.) Read tag 64 wait-state compare tag 64 write tag 64 etc.

With an FSM, it could be something like: Read tag 0 Read tag 64 Read tag 128 + compare tag 0 Read tag 192 + compare tag 64 write tag 0 + compare tag 128 write tag 64 + compare tag 192 write tag 128 write tag 192

If software is trying to flush say 8KB from a 64KB cache, doing an address-range operation with an FSM can be much more efficient than either doing a software loop line by line or doing an FSM-based flush of the entire cache (which will create lots of extra cache misses later).

Cheers, Phil

gfavor commented 3 years ago

On Tue, Sep 22, 2020 at 12:45 PM PhilNMcCoy notifications@github.com wrote:

Even an unrolled loop of CMO Clean operations would have pipeline stalls (example assuming 64B lines)

CMO Clean 0 CMO Clean 64 CMO Clean 128 CMO Clean 192

...

If software is trying to flush say 8KB from a 64KB cache, doing an address-range operation with an FSM can be much more efficient than either doing a software loop line by line

Yes but only if all the resultant writeback traffic isn't the limiter. Whether software or hardware scans through an 8KB address range and causes 8KB's worth of cache line "cleans" (i.e. writebacks of dirty data to memory), unless the CPU and the system can perform a cache line-sized block transfer every couple of CPU clocks on a sustained basis - including resolving coherency (e.g. doing any needed snoops, etc.) for each clean line being "cleaned", then software versus hardware won't really matter.

or doing an FSM-based flush of the entire cache (which will create lots of

extra cache misses later).

I agree that if you only want to operate across a range of addresses, then you don't want to be using a "whole cache" operation.

As a side note, and taking this 8KB / 64KB example and assuming 8-way set-associativity, I would observe that cleaning 8 KB out of this cache involves scanning through all sets of the cache in all these approaches (e.g. the FSM-based clean of the whole cache would do the same number of cache lookups as an address range-based FSM). Only with greater or lesser associativity would there be a difference (but not a factor of 8x).

Greg

billhuffman commented 3 years ago

Taking the same 8KB/64KB example Greg comments on, if the cache is a 4-way sectored cache, the FSM takes 1/4 the cycles to go through.

  Bill
gfavor commented 3 years ago

On Tue, Sep 22, 2020 at 12:45 PM PhilNMcCoy notifications@github.com wrote:

I would have said that this IS the preferred way to remove a poisoned line from the coherence domain. What if the error is in the cache tag RAM - how do you know which address to target with your DCBZ, and how does the hardware know which way (if any) has that address? I don't want to digress into a big debate about ECC - I don't think it's part of the charter for this workgroup anyway and I don't want to delay ratification of useful CMOs. Suffice it to say that we're not in universal agreement that DCBZ is the One True Way to handle ECC.

I wasn't trying to imply that DCBZ is the appropriate hammer for all ECC nails. I agree that it isn't.

If you have a poisoned line in the cache/memory system (whether it's in your cache or someone else's cache or out in DRAM), that generally means a line with a valid address but corrupted data. For coherently removing the poison on this address, a DCBZ is a very nice hammer (with not much in the way of other good ISA options).

If one has a cache line with a corrupted tag, then one has a much bigger problem since there now is an unknown address in the system that is potentially corrupted (if this line held dirty data). To clean up this cache entry, a DCBZ is useless. What you want is a set/way line invalidate operation. And in any case there is a bigger system-level issue to be dealt with.

I agree that we don't want to digress into a RAS discussion (especially since a new RAS TG will be forming soon). But removing poison from an address in the system is a notable use case for DCBZ (besides the popular block zero'ing of memory use cases), especially since there aren't any other good options in the current ISA or in any currently contemplated arch extensions.

Greg

AndyGlew commented 3 years ago

BTW, I am going to break my rule about trying to have discussions email and not on the list, and I just want to make two points that are directly relevant:

The performance advantage in using a hardware FSM is NOT the biggest justification for address range.

IMHO one of the biggest reasons is dealing wih "idiosyncratic and inconsistent systems" when you have a system that is assembled out of IP blocks from different vendors. Some of the cache IP blocks may not respond to the CMO bus transactions emitted by your CPU. Sometimes the bus bridge between your CPU's preferred bus and the busses used elsewhere in the system bridge ordinary loads and stores, but don't bridge CMO bus transactions. Etc. Each cache IP block probably has a mechanism to do CMOs - e.g. MMIOs - but they may be different for different vendors. Heterogenous multiprocessor systems can be worse - a mixture of CPUs, GPUs, DSPs, multiple of each from different vendors.

We would prefer not to have user or even system code know about such idiosyncrasies.

It is straightforward to trap any CMO instructions to M-mode, and then have M-mode do whatever is needed to deal with your idiosyncratic system. That's the RISC way.

But trapping on every 64B cache line can be really expensive.

Whereas if you have address ranges, you can trap once for the entire range.


Same issue, different point of view: if you are a RISC-V vendor who has already shipped hardware using whatever cache flush mechanism you have - definitely not the RISC-V CMO standard, because that does not exist yet - would it not be nice to be able to ship a software patch so that systems that are already in the field can run new code using the CMO instructions that we wikll define soon?

The general solution to such compatibility, running new code on old systems, is trap and emulate.

But trap and emulate is slow. Unless you can handle multiple such operations in a single trap.


Put another way, compatibility and performance of the CMOs is one of the modifications or address range.

compatibility: interfacing to hardware that does not provide the bus support or the full system support needed to transparently implement the CMO instructions

performance: the performance of trap and emulate that provides that compatibility.

I.e. not the performance of a hardware FSM, but the performance of software emulation.


Companies that build the entire system have the benefit of having ensured that the entire system works together.

However, there are markets that do not have this luxury. I am tempted to say the "embedded" market that's not 100% true - there are some embedded product lines that don't have to live with such heterogenous and idiosyncratic systems. However, there are some that do.

Moreover, even if the vendors of the different IP blocks or CPUs and GPUs and DSPs and caches and bus bridges are willing to work together to make sure that CPUs CMO instructions get properly put onto the bus and bridged two other buses and interpreted by other cache IP blocks, sometimes it takes an extra six months to do so. In some markets, that makes a big difference.


There are other reasons, many of which are on the wiki and/or in the original proposal. But this is one of the biggest reasons.


gfavor commented 3 years ago

Leaving aside my own bias, one question is whether "idiosyncratic and inconsistent systems" are the tail wagging the dog. Is this the atypical design and 95%+ designs don't have this issue, or is this something that a significant fraction of systems have to deal with?

Picking on the point about one CPU needing to do a global CMO that covers other CPU's non-coherent caches (aka "cache IP blocks that may not respond to the CMO bus transactions emitted by your CPU"), that sounds like an "interesting" system. This would be in contrast to a system where a CPU with a non-coherent cache software-manages its own cache?

Should a big goal of base CMOs be to support encapsulating all the system-specific vagaries of partially-coherent and non-coherent systems in trappable global CMO instructions?

Greg

My own experience (not that it is representative of lower-end embedded designs) is that an IP

On Wed, Sep 23, 2020 at 2:38 PM AndyGlew notifications@github.com wrote:

BTW, I am going to break my rule about trying to have discussions email and not on the list, and I just want to make two points that are directly relevant:

The performance advantage in using a hardware FSM is NOT the biggest justification for address range.

IMHO one of the biggest reasons is dealing wih "idiosyncratic and inconsistent systems" when you have a system that is assembled out of IP blocks from different vendors. Some of the cache IP blocks may not respond to the CMO bus transactions emitted by your CPU. Sometimes the bus bridge between your CPU's preferred bus and the busses used elsewhere in the system bridge ordinary loads and stores, but don't bridge CMO bus transactions. Etc. Each cache IP block probably has a mechanism to do CMOs

  • e.g. MMIOs - but thedy may be different for different vendors,.

We would prefer not to have user or even system code know about such idiosyncrasies.

It is straightforward to trap any CMO instructions to M-mode, and then have M-mode do whatever is needed to deal with your idiosyncratic system. That's the RISC way.

But trapping on every 64B cache line can be really expensive.

Whereas if you have address ranges, you can trap once for the entire range.

There are other reasons, many of which are on the wiki and/or in the original proposal. But this is one of the biggest reasons.

Put another way: if you are a RISC-V vendor who has already shipped hardware using whatever cache flush mechanism you have - definitely not the RISC-V CMO standard, bedcause that does not exist yet - would it not be nice to be able to ship a software patch so that systems that are already in the field can run new code using the CMO instructions that we wikll define soon?

The general solution to such compatibility, running new code on old systems, is trap and emulate.

But trap and emulate is slow. Unless you can handle multiple such operations in a single trap.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-CMOs/issues/9#issuecomment-697986234, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLX6GSYSYLUUDRNFOTNNZDSHJTDRANCNFSM4QTSAJKA .