riscv / riscv-CMOs

https://jira.riscv.org/browse/RVG-59
Creative Commons Attribution 4.0 International
78 stars 12 forks source link

Define possible loop friendly code prefetch instructions (not currently recommended for POR, but FTR) #4

Closed AndyGlew closed 2 years ago

AndyGlew commented 3 years ago

The original proposal contains an address range code prefetch instruction CMO.VAR.PREFTCH-X .

But this is not really friendly for use inside a pipelined loop.

isolate double loop friendly prefetch instructions in general use addressing modes similar to that of the code without any prefetches.

or instruction prefetch, this would look like

possibly

allenjbaum commented 3 years ago

You need examples of what the code that uses this would look like in both cases So, to be clear, instead of PREFETCH_ORDINARY rd, lower, upper where rd indicates where the prefetch finished (or the first unprefetched byte...) which need to be put into a loop:

LA     lower, lower_Addr
LA     upper, upper_Addr

loop: CMO.VAR.PREFTCH-X lower, lower, upper slt notdone, lower, upper bne notdone, loop

You want LA lower, lower_Addr LA upper, upper_Addr loop: PREFETCH_ORDINARY 0(lower) PREFETCH_ORDINARY 64(lower) PREFETCH_ORDINARY 128(lower) PREFETCH_ORDINARY 192(lower) addi lower, lower, 64 slt notdone, lower, upper bne notdone, loop

AndyGlew commented 3 years ago

Imagine a very log straight-line loop:

loop: inst1 ... instN if ?? goto loop

Where the sizxe of the loop body is very large, exceeding the size of a typical L0 or L1 I-cache.

(IIRC there was such a benchmark in an earlier version of SPEC.)

Add prefetches for code as below.
If you wish, start off by loading up the first I-cache worth of the loop

       // range based prefetch
       // using the [lwb,upb) approach
       // similar if [lwb,lwb+count) or the like
       x1 := top_of_loop
       x2 := top_of_loop + ICACHE_SIZE  // usually not full size, half or similar
address_range_loop:
       prefetch-X.var x1,x1,x2
       bne x1,x2, address_range_loop

top_of_loop:
       inst1
       inst2
       ...
       inst16

      // first 64B cache line of code done. 
      // prefetch ahead the cache size (or fraction thereof) 
      prefetch-X pc + n  
      // where n is the prefetch distance 
      // n > ordinary linear instruction streaming

      // note: this example only works if the cache size can be expressed 
      // in the pc+offset addressing mode.
      // the offset may be too small... 

      inst17
      ...
      inst32
      prefetch-X pc + n  

      // every I-cache line (16 instructions here), fetch cache line ICACHE_SIZE away in the linear flow.

       ...
end_of_first_ICACHE_SIZE_chunk_of_code:
       ...
       ...
      instN
      if ?? goto loop

// where the entire looop may be several times the size of the L0 or L1 ICACHE you are managing.

This is for straight line code. There was at least one real SPEC benchmark example that fit this pattern. Also, some RTL simulators emit single assignment code that matches this pattern - all of the computations in a clock cyckle, straight line, no branches. (Or at least emitted - compilers have been doing re-rolling optimizatiions for years to make such long straightline code more ICACHE friendly. Reversing lop unrollinf.)

Apart from these (real) examples, prefetching straight line code is not very interesting. Very long loops that have some internal branching - hammocks - can also use such prefetches.

It is hard to use such code prefetches on very branchy code. Which is why this is a "FTR (For The Record)" issue, rather than a serious proposal.

Address range prefetches for code have more obvious usage. Whether implemented as address range PREFETCH-X.VAR instructions, or via PREFETCH-X that uses a fixed sizxe block.

dkruckemyer-ventana commented 2 years ago

Closing due to lack of discussion.