Closed AndyGlew closed 2 years ago
You need examples of what the code that uses this would look like in both cases So, to be clear, instead of PREFETCH_ORDINARY rd, lower, upper where rd indicates where the prefetch finished (or the first unprefetched byte...) which need to be put into a loop:
LA lower, lower_Addr
LA upper, upper_Addr
loop: CMO.VAR.PREFTCH-X lower, lower, upper slt notdone, lower, upper bne notdone, loop
You want LA lower, lower_Addr LA upper, upper_Addr loop: PREFETCH_ORDINARY 0(lower) PREFETCH_ORDINARY 64(lower) PREFETCH_ORDINARY 128(lower) PREFETCH_ORDINARY 192(lower) addi lower, lower, 64 slt notdone, lower, upper bne notdone, loop
Imagine a very log straight-line loop:
loop: inst1 ... instN if ?? goto loop
Where the sizxe of the loop body is very large, exceeding the size of a typical L0 or L1 I-cache.
(IIRC there was such a benchmark in an earlier version of SPEC.)
Add prefetches for code as below.
If you wish, start off by loading up the first I-cache worth of the loop
// range based prefetch
// using the [lwb,upb) approach
// similar if [lwb,lwb+count) or the like
x1 := top_of_loop
x2 := top_of_loop + ICACHE_SIZE // usually not full size, half or similar
address_range_loop:
prefetch-X.var x1,x1,x2
bne x1,x2, address_range_loop
top_of_loop:
inst1
inst2
...
inst16
// first 64B cache line of code done.
// prefetch ahead the cache size (or fraction thereof)
prefetch-X pc + n
// where n is the prefetch distance
// n > ordinary linear instruction streaming
// note: this example only works if the cache size can be expressed
// in the pc+offset addressing mode.
// the offset may be too small...
inst17
...
inst32
prefetch-X pc + n
// every I-cache line (16 instructions here), fetch cache line ICACHE_SIZE away in the linear flow.
...
end_of_first_ICACHE_SIZE_chunk_of_code:
...
...
instN
if ?? goto loop
// where the entire looop may be several times the size of the L0 or L1 ICACHE you are managing.
This is for straight line code. There was at least one real SPEC benchmark example that fit this pattern. Also, some RTL simulators emit single assignment code that matches this pattern - all of the computations in a clock cyckle, straight line, no branches. (Or at least emitted - compilers have been doing re-rolling optimizatiions for years to make such long straightline code more ICACHE friendly. Reversing lop unrollinf.)
Apart from these (real) examples, prefetching straight line code is not very interesting. Very long loops that have some internal branching - hammocks - can also use such prefetches.
It is hard to use such code prefetches on very branchy code. Which is why this is a "FTR (For The Record)" issue, rather than a serious proposal.
Address range prefetches for code have more obvious usage. Whether implemented as address range PREFETCH-X.VAR instructions, or via PREFETCH-X that uses a fixed sizxe block.
Closing due to lack of discussion.
The original proposal contains an address range code prefetch instruction CMO.VAR.PREFTCH-X.
But this is not really friendly for use inside a pipelined loop.
isolate double loop friendly prefetch instructions in general use addressing modes similar to that of the code without any prefetches.
or instruction prefetch, this would look like
possibly
PREFETCH-X... pc+rs1+offset
the usual discussion about whether you want this or not. I just want to record this issue for now, rather than going and spending an hour writing it up.