Closed sunnycase closed 10 months ago
It's a good question, but it's not addressed by the ISA spec. Your question is more about optimizing code for a particular microarchitecture. I suggest filing your question with the designers of the processor you are targeting.
For what it's worth, SiFive's X280, an in-order superscalar processor, is able to effectively utilize memory bandwidth in this example --- with higher LMUL --- without unrolling. So if you were targeting X280, I would say "don't bother unrolling; increase LMUL instead."
Increasing LMUL should have similar effect, while keeping the stripmining bookkeeping much simpler.
In an in-order superscalar cpu, the vadd.vv needs to wait for the vle32.v to return. So when the cpu is executing vadd.vv, the LSU isn't working and the memory bandwidth is not used.
The LSU works with any non-trivial uArch with chaining support.
@nick-knight @aswaterman @sequencer Thank you for your help :)
It's a good question, but it's not addressed by the ISA spec. Your question is more about optimizing code for a particular microarchitecture. I suggest filing your question with the designers of the processor you are targeting.
For what it's worth, SiFive's X280, an in-order superscalar processor, is able to effectively utilize memory bandwidth in this example --- with higher LMUL --- without unrolling. So if you were targeting X280, I would say "don't bother unrolling; increase LMUL instead."
I'm using T-Head C908, maybe I should try SiFive :)
@sunnycase I've got a bunch of benchmarks that are meant to help answer such questions at: https://camel-cdr.github.io/rvv-bench-results/canmv_k230/index.html
Specifically the utf8_count benchmark, has unrolling, manual tail handling, and pointer aligning with different LMULs. As you can see, even with LMUL=8
unrolling can sometimes still gain a bit performance on top just LMUL=8
. This is probably mostly due to it being an in-order core, the ooo C920 didn't gain anything from it.
For the C908 I'd generally recommend using the maximal LMUL when possible, and you aren't using the permutation instructions.
The C920 had a problem when doing LMUL=8
load & stores with little in between, e.g memcpy, it would be slower than using LMUL=4
. Presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. But this isn't a problem for the C908, and I don't expect it to be a problem in future cores.
@nick-knight Do you know how vcompress.vm
scales with LMUL
on the X280
? The llvm-mca entries for it seem very bad (e8,m1: 8 cycles, e8,m8: 64 cycles, e8,m1: 64 cycles, e8,m8: 512 cycles), although that would also be understandable, given that the X280
was mostly design for AI/ML.
@nick-knight Do you know how vcompress.vm scales with LMUL on the X280? The llvm-mca entries for it seem very bad (e8,m1: 8 cycles, e8,m8: 64 cycles, e8,m1: 64 cycles, e8,m8: 512 cycles), although that would also be understandable, given that the X280 was mostly design for AI/ML.
On X280, vcompress.vm
occupies the vector arithmetic sequencer for roughly vl
cycles. Since VLEN = 512 on this core, you should expect (and I have measured) roughly 512 cycles for the e8,m8
case, for example. X280 has a special optimization in the case vl * SEW <= 256
(the datapath width): it only occupies the vector arithmetic sequencer for ~1 cycle. However, it is very tricky to leverage this optimization because vectorization at mf2
greatly reduces the benefits of the decoupled microarchitecture, and unrolling to expose ILP (the original topic of this thread) often does become necessary. (So, good question :)
IIRC, an earlier cost model we had upstreamed did not accurately capture the details of X280's vcompress.vm
implementation. Perhaps that's what you're seeing. (Although I suspect there's a typo in the text you've quoted.) Anyway, I believe it's been updated in our downstream dev toolchain, but I don't know the upstream status. I'll nudge the compiler team.
Thanks for the reply, that lined up with what I expected. That definitely was a copy pasting error, I meant (e64,m1: 8 cycles, e64,m8: 64 cycles, e8,m1: 64 cycles, e8,m8: 512 cycles, see https://godbolt.org/z/Ybno1Eh7G)
X280 has a special optimization in the case
vl * SEW <= 256
(the datapath width): it only occupies the vector arithmetic sequencer for ~1 cycle. However, it is very tricky to leverage this optimization because vectorization atmf2
greatly reduces the benefits of the decoupled microarchitecture, and unrolling to expose ILP (the original topic of this thread) often does become necessary. (So, good question :)
This could be quite beneficial if the same thing exists for vrgather.vv
, as may uses are 16 and 32 byte lookup tables. Although it would certainly be easier to make use of, if dispatching happens based on vl
instead of LMUL
(you kinda implied it happens based on LMUL
).
This could be quite beneficial if the same thing exists for
vrgather.vv
Yes, a similar optimization exists for vrgather.vv
. Again, in the general case, it has vl
-cycle occupancy, but has a ~1-cycle occupancy when LMUL <= 1/2
, or when vl * SEW <= 256
and all active indices are < 256/SEW
. IIRC, there is a penalty in the latter case, due to performing the bounds checks on the indices, but I forget the details. (And note that the mf2
special case precludes SEW = 64
.) We didn't implement a similar optimization for vrgatherei16.vv
: that's always vl
-cycle occupancy. Hopefully all of this is up-to-date in the upstream cost model.
While we're in the weeds of SiFive's implementations, I'll mention that our Mallard microarchitecture (P470/P670/P870/etc.) throws a lot more circuitry at vrgather{ei16}.vv
and vcompress.vm
. You can view it as having a superset of the aforementioned X280 optimizations. I'm not sure what details have been published yet so I'll hold off for now.
@sunnycase I've got a bunch of benchmarks that are meant to help answer such questions at: https://camel-cdr.github.io/rvv-bench-results/canmv_k230/index.html
Specifically the utf8_count benchmark, has unrolling, manual tail handling, and pointer aligning with different LMULs. As you can see, even with
LMUL=8
unrolling can sometimes still gain a bit performance on top justLMUL=8
. This is probably mostly due to it being an in-order core, the ooo C920 didn't gain anything from it.For the C908 I'd generally recommend using the maximal LMUL when possible, and you aren't using the permutation instructions.
The C920 had a problem when doing
LMUL=8
load & stores with little in between, e.g memcpy, it would be slower than usingLMUL=4
. Presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. But this isn't a problem for the C908, and I don't expect it to be a problem in future cores.@nick-knight Do you know how
vcompress.vm
scales withLMUL
on theX280
? The llvm-mca entries for it seem very bad (e8,m1: 8 cycles, e8,m8: 64 cycles, e8,m1: 64 cycles, e8,m8: 512 cycles), although that would also be understandable, given that theX280
was mostly design for AI/ML.
Thank you for the benchmarks, it's very useful.
@camel-cdr I have quickly viewed the source code. But it seems you didn't reorder the instructions to make loads be issued contiguously. Maybe the in-order superscalar cpu (C908) can not benefit from the unrolling without reordering. https://github.com/camel-cdr/rvv-bench/blob/main/bench/utf8_count.S#L110
Ah, thats true. I'll try to fix this later today.
@sunnycase It looks like instruction reordering (rvv_4x_tail
) actually slows it down compared to the simple unroll (rvv_4x
): https://camel-cdr.github.io/rvv-bench-results/canmv_k230/utf8_count.html
I used codegen from gcc to schedule the instructions properly, I also tested clangs codegen, but that was slower.
Closing, as this issue isn't germane to the ISA spec.
The source code of vvadd is at https://github.com/plctlab/rvv-benchmark/blob/master/vvaddint32.s I pasted it here:
I have a question about the loop: In an in-order superscalar cpu, the
vadd.vv
needs to wait for thevle32.v
to return. So when the cpu is executingvadd.vv
, the LSU isn't working and the memory bandwidth is not used. Should I unroll the loop to make the cpu execute the thirdvle32.v
when the firstvadd.vv
is being issued to fully utilize the memory bandwidth?