Should I unroll the loop in implementation of vvadd?

sunnycase commented 10 months ago

The source code of vvadd is at https://github.com/plctlab/rvv-benchmark/blob/master/vvaddint32.s I pasted it here:

    .text
    .balign 4
    .global vvaddint32
    # vector-vector add routine of 32-bit integers
    # void vvaddint32(size_t n, const int*x, const int*y, int*z)
    # { for (size_t i=0; i<n; i++) { z[i]=x[i]+y[i]; } }
    #
    # a0 = n, a1 = x, a2 = y, a3 = z
    # Non-vector instructions are indented
vvaddint32:
    vsetvli t0, a0, e32, m1, ta,ma  # Set vector length based on 32-bit vectors
    vle32.v v0, (a1)         # Get first vector
      sub a0, a0, t0         # Decrement number done
      slli t0, t0, 2         # Multiply number done by 4 bytes
      add a1, a1, t0         # Bump pointer
    vle32.v v1, (a2)         # Get second vector
      add a2, a2, t0         # Bump pointer
    vadd.vv v2, v0, v1       # Sum vectors
    vse32.v v2, (a3)         # Store result
      add a3, a3, t0         # Bump pointer
      bnez a0, vvaddint32    # Loop back
      ret                    # Finished

I have a question about the loop: In an in-order superscalar cpu, the vadd.vv needs to wait for the vle32.v to return. So when the cpu is executing vadd.vv, the LSU isn't working and the memory bandwidth is not used. Should I unroll the loop to make the cpu execute the third vle32.v when the first vadd.vv is being issued to fully utilize the memory bandwidth?

vle32.v v0
vle32.v v1
vle32.v v2
vle32.v v3
vadd.vv v0, v0, v1
vadd.vv v2, v2, v3
vse32.v v0
vse32.v v2

nick-knight commented 10 months ago

It's a good question, but it's not addressed by the ISA spec. Your question is more about optimizing code for a particular microarchitecture. I suggest filing your question with the designers of the processor you are targeting.

For what it's worth, SiFive's X280, an in-order superscalar processor, is able to effectively utilize memory bandwidth in this example --- with higher LMUL --- without unrolling. So if you were targeting X280, I would say "don't bother unrolling; increase LMUL instead."

aswaterman commented 10 months ago

Increasing LMUL should have similar effect, while keeping the stripmining bookkeeping much simpler.

sequencer commented 10 months ago

In an in-order superscalar cpu, the vadd.vv needs to wait for the vle32.v to return. So when the cpu is executing vadd.vv, the LSU isn't working and the memory bandwidth is not used.

The LSU works with any non-trivial uArch with chaining support.

sunnycase commented 10 months ago

@nick-knight @aswaterman @sequencer Thank you for your help :)

sunnycase commented 10 months ago

It's a good question, but it's not addressed by the ISA spec. Your question is more about optimizing code for a particular microarchitecture. I suggest filing your question with the designers of the processor you are targeting.

For what it's worth, SiFive's X280, an in-order superscalar processor, is able to effectively utilize memory bandwidth in this example --- with higher LMUL --- without unrolling. So if you were targeting X280, I would say "don't bother unrolling; increase LMUL instead."

I'm using T-Head C908, maybe I should try SiFive :)

camel-cdr commented 10 months ago

@sunnycase I've got a bunch of benchmarks that are meant to help answer such questions at: https://camel-cdr.github.io/rvv-bench-results/canmv_k230/index.html

Specifically the utf8_count benchmark, has unrolling, manual tail handling, and pointer aligning with different LMULs. As you can see, even with LMUL=8 unrolling can sometimes still gain a bit performance on top just LMUL=8. This is probably mostly due to it being an in-order core, the ooo C920 didn't gain anything from it.

For the C908 I'd generally recommend using the maximal LMUL when possible, and you aren't using the permutation instructions.

The C920 had a problem when doing LMUL=8 load & stores with little in between, e.g memcpy, it would be slower than using LMUL=4. Presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. But this isn't a problem for the C908, and I don't expect it to be a problem in future cores.

@nick-knight Do you know how vcompress.vm scales with LMUL on the X280? The llvm-mca entries for it seem very bad (e8,m1: 8 cycles, e8,m8: 64 cycles, e8,m1: 64 cycles, e8,m8: 512 cycles), although that would also be understandable, given that the X280 was mostly design for AI/ML.

nick-knight commented 10 months ago

@nick-knight Do you know how vcompress.vm scales with LMUL on the X280? The llvm-mca entries for it seem very bad (e8,m1: 8 cycles, e8,m8: 64 cycles, e8,m1: 64 cycles, e8,m8: 512 cycles), although that would also be understandable, given that the X280 was mostly design for AI/ML.

On X280, vcompress.vm occupies the vector arithmetic sequencer for roughly vl cycles. Since VLEN = 512 on this core, you should expect (and I have measured) roughly 512 cycles for the e8,m8 case, for example. X280 has a special optimization in the case vl * SEW <= 256 (the datapath width): it only occupies the vector arithmetic sequencer for ~1 cycle. However, it is very tricky to leverage this optimization because vectorization at mf2 greatly reduces the benefits of the decoupled microarchitecture, and unrolling to expose ILP (the original topic of this thread) often does become necessary. (So, good question :)

IIRC, an earlier cost model we had upstreamed did not accurately capture the details of X280's vcompress.vm implementation. Perhaps that's what you're seeing. (Although I suspect there's a typo in the text you've quoted.) Anyway, I believe it's been updated in our downstream dev toolchain, but I don't know the upstream status. I'll nudge the compiler team.

camel-cdr commented 10 months ago

Thanks for the reply, that lined up with what I expected. That definitely was a copy pasting error, I meant (e64,m1: 8 cycles, e64,m8: 64 cycles, e8,m1: 64 cycles, e8,m8: 512 cycles, see https://godbolt.org/z/Ybno1Eh7G)

X280 has a special optimization in the case vl * SEW <= 256 (the datapath width): it only occupies the vector arithmetic sequencer for ~1 cycle. However, it is very tricky to leverage this optimization because vectorization at mf2 greatly reduces the benefits of the decoupled microarchitecture, and unrolling to expose ILP (the original topic of this thread) often does become necessary. (So, good question :)

This could be quite beneficial if the same thing exists for vrgather.vv, as may uses are 16 and 32 byte lookup tables. Although it would certainly be easier to make use of, if dispatching happens based on vl instead of LMUL (you kinda implied it happens based on LMUL).

nick-knight commented 10 months ago

This could be quite beneficial if the same thing exists for vrgather.vv

Yes, a similar optimization exists for vrgather.vv. Again, in the general case, it has vl-cycle occupancy, but has a ~1-cycle occupancy when LMUL <= 1/2, or when vl * SEW <= 256 and all active indices are < 256/SEW. IIRC, there is a penalty in the latter case, due to performing the bounds checks on the indices, but I forget the details. (And note that the mf2 special case precludes SEW = 64.) We didn't implement a similar optimization for vrgatherei16.vv: that's always vl-cycle occupancy. Hopefully all of this is up-to-date in the upstream cost model.

While we're in the weeds of SiFive's implementations, I'll mention that our Mallard microarchitecture (P470/P670/P870/etc.) throws a lot more circuitry at vrgather{ei16}.vv and vcompress.vm. You can view it as having a superset of the aforementioned X280 optimizations. I'm not sure what details have been published yet so I'll hold off for now.

sunnycase commented 10 months ago

@sunnycase I've got a bunch of benchmarks that are meant to help answer such questions at: https://camel-cdr.github.io/rvv-bench-results/canmv_k230/index.html

Specifically the utf8_count benchmark, has unrolling, manual tail handling, and pointer aligning with different LMULs. As you can see, even with LMUL=8 unrolling can sometimes still gain a bit performance on top just LMUL=8. This is probably mostly due to it being an in-order core, the ooo C920 didn't gain anything from it.

For the C908 I'd generally recommend using the maximal LMUL when possible, and you aren't using the permutation instructions.

The C920 had a problem when doing LMUL=8 load & stores with little in between, e.g memcpy, it would be slower than using LMUL=4. Presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. But this isn't a problem for the C908, and I don't expect it to be a problem in future cores.

@nick-knight Do you know how vcompress.vm scales with LMUL on the X280? The llvm-mca entries for it seem very bad (e8,m1: 8 cycles, e8,m8: 64 cycles, e8,m1: 64 cycles, e8,m8: 512 cycles), although that would also be understandable, given that the X280 was mostly design for AI/ML.

Thank you for the benchmarks, it's very useful.

sunnycase commented 10 months ago

@camel-cdr I have quickly viewed the source code. But it seems you didn't reorder the instructions to make loads be issued contiguously. Maybe the in-order superscalar cpu (C908) can not benefit from the unrolling without reordering. https://github.com/camel-cdr/rvv-bench/blob/main/bench/utf8_count.S#L110

camel-cdr commented 10 months ago

Ah, thats true. I'll try to fix this later today.

camel-cdr commented 10 months ago

@sunnycase It looks like instruction reordering (rvv_4x_tail) actually slows it down compared to the simple unroll (rvv_4x): https://camel-cdr.github.io/rvv-bench-results/canmv_k230/utf8_count.html

I used codegen from gcc to schedule the instructions properly, I also tested clangs codegen, but that was slower.

aswaterman commented 10 months ago

Closing, as this issue isn't germane to the ISA spec.

riscv / riscv-v-spec

Should I unroll the loop in implementation of vvadd? #929