google / android-riscv64

Issues and discussions around RISC-V support in AOSP.
Apache License 2.0
211 stars 15 forks source link

Investigate the current state of Auto-vectorization for RISC-V targets #23

Open appujee opened 1 year ago

appujee commented 1 year ago
topperc commented 1 year ago

There is no instruction scheduling for RISC-V vectors.

vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

nikolaypanchenko commented 1 year ago

For the

Compiling TSVC benchmark would be a good way to find out if commonly found loop structures are getting vectorized.

What is a success criteria ? Is it about comparing vectorizable (i.e. that will be vectorized if heuristics are disabled) vs AArch64 and X86 ?

idbaev commented 1 year ago

It will be great to get examples from Android where auto-vectorization for RISC-V Vectors (RVV) is not performing as expected, e.g. compared with X86 and AArch64 targets.

appujee commented 1 year ago

What is a success criteria ? Is it about comparing vectorizable (i.e. that will be vectorized if heuristics are disabled) vs AArch64 and X86 ?

It is usually a good exercise to compare number of vectorizable loops, it helps tune the vectorizer and find more opportunities for vectorization. We should do this comparative analysis at least w.r.t. AArch64.

Once this is done, we can do a comparative analysis on larger code base like Android.

appujee commented 1 year ago

There is no instruction scheduling for RISC-V vectors.

vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

I was thinking if inlining may bring up redundant vsetvlis. In functions with multiple loops this can also happen.

topperc commented 1 year ago

There is no instruction scheduling for RISC-V vectors. vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

I was thinking if inlining may bring up redundant vsetvlis.

vsetvli intrinsics are allowed to CSE as of last week.

In functions with multiple loops this can also happen.

This depends on what style of vector loop you're right. If you're using vsetvli inside the loop to avoid tail iterations then you'll need vsetvlis inside each loop so none are redundant. If you're using vsetvlmax and operating on whole registers in the loop then yes there could be a redundant one for each loop.

The current loop vectorizer operates on whole registers but doesn't use vsetvli intrinsics. The vsetvlis are all inserted by the insertion pass which runs just before machine IR leaves SSA form.

appujee commented 1 year ago

vsetvli intrinsics are allowed to CSE as of last week.

ah ok. this should be sufficient. thanks for clarifying.

nikolaypanchenko commented 1 year ago

@appujee could you please provide options to use for ARM to compiler TSVC benchmark ? Do you have specific options for RISC-V ?

appujee commented 1 year ago

Try -mcpu=cortex-a55 for ARM

For RISC-V rv64gcv, please share if you have a cpu flag that gives better vectorization.

nikolaypanchenko commented 1 year ago

The number of loops vectorized as-is using upstream LLVM 5c1b8de77d1c:

Arch Number of vectorized loops
-march= rv64gcv 1299
-mcpu=cortex-a55 904

Obviously, the performance of vectorized loops is a different aspect, but it won't be easy to answer for RISC-V in general

appujee commented 1 year ago

nice! is it possible to know how many loops we start with in both the cases? Like do we have something like 'number of loops analyzed'. It could be that inlining etc. resulted in different number of loops to begin with.

nikolaypanchenko commented 1 year ago

Updated: my original numbers didn't include loops from tsvc.c

Arch default fp-model (strict for Clang) fp-model=strict (same as default) fp-model=fast
#LoopsAnalyzed #LoopsVectorized #LoopsAnalyzed #LoopsVectorized #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 735 460 735 460 735 667
-mcpu=cortex-a55 736 176 736 176 735 635

Details:

Arch #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 555 373
-mcpu=cortex-a55 555 176
tsvc.c Arch #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 180 87
-mcpu=cortex-a55 181 0
Arch #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 555 553
-mcpu=cortex-a55 555 552

tsvc.c

Arch #LoopsAnalyzed #LoopsVectorized
-march= rv64gcv 180 114
-mcpu=cortex-a55 180 83
appujee commented 1 year ago

That's very promising as RISCV is ahead. I've marked the first item as done. Thanks for helping with this.

appujee commented 1 year ago

There is no instruction scheduling for RISC-V vectors.

vsetvli intrinsics are allowed to CSE as of last week.

As per: https://github.com/llvm/llvm-project/issues/58834 there is still room for removing redundant vsetvlis? cc: @topperc @nikolaypanchenko

topperc commented 1 year ago

There is no instruction scheduling for RISC-V vectors.

vsetvli intrinsics are allowed to CSE as of last week.

As per: llvm/llvm-project#58834 there is still room for removing redundant vsetvlis? cc: @topperc @nikolaypanchenko

The code in that ticket does not look like how the current upstream vectorizer or the proposed VP intrinsic vectorizer from our downstream generate code so I don't think it is directly relevant to autovectorization.

nikolaypanchenko commented 1 year ago
  • Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

I believe @appujee refers to the 3d task:

Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

which may or may not be treated as any vset*vli generated by codegen within vectorized loop. @topperc do you know if anyone started to look at that reported issue ?

topperc commented 1 year ago

I believe @appujee refers to the 3d task:

Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

which may or may not be treated as any vset*vli generated by codegen within vectorized loop.

What we currently have doesn't remove any vsetvlis generated by explicit vsetvli intrinsics. We do a reaching def like analysis to insert additional vsetvlis whereever we think they are needed to satisfy SEW, LMUL, tail policy, mask policy needed for the vector load/store/arithmetic instructions.

@topperc do you know if anyone started to look at that reported issue ?

I don't think anyone has looked at the issue. We have a reaching definition analysis. We detect a mismatch because the preheader edge sees the vsetvli in the preheader and the backedge sees the vsetvli from the previous iteration.

appujee commented 1 year ago

Pipeline cost model for vector instructions: D149495 posted by michaelmaitland. We can use -mcpu=sifive-x280 to try it out.