Open appujee opened 1 year ago
There is no instruction scheduling for RISC-V vectors.
vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.
For the
Compiling TSVC benchmark would be a good way to find out if commonly found loop structures are getting vectorized.
What is a success criteria ? Is it about comparing vectorizable (i.e. that will be vectorized if heuristics are disabled) vs AArch64 and X86 ?
It will be great to get examples from Android where auto-vectorization for RISC-V Vectors (RVV) is not performing as expected, e.g. compared with X86 and AArch64 targets.
What is a success criteria ? Is it about comparing vectorizable (i.e. that will be vectorized if heuristics are disabled) vs AArch64 and X86 ?
It is usually a good exercise to compare number of vectorizable loops, it helps tune the vectorizer and find more opportunities for vectorization. We should do this comparative analysis at least w.r.t. AArch64.
Once this is done, we can do a comparative analysis on larger code base like Android.
There is no instruction scheduling for RISC-V vectors.
vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.
I was thinking if inlining may bring up redundant vsetvlis. In functions with multiple loops this can also happen.
There is no instruction scheduling for RISC-V vectors. vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.
I was thinking if inlining may bring up redundant vsetvlis.
vsetvli intrinsics are allowed to CSE as of last week.
In functions with multiple loops this can also happen.
This depends on what style of vector loop you're right. If you're using vsetvli inside the loop to avoid tail iterations then you'll need vsetvlis inside each loop so none are redundant. If you're using vsetvlmax and operating on whole registers in the loop then yes there could be a redundant one for each loop.
The current loop vectorizer operates on whole registers but doesn't use vsetvli intrinsics. The vsetvlis are all inserted by the insertion pass which runs just before machine IR leaves SSA form.
vsetvli intrinsics are allowed to CSE as of last week.
ah ok. this should be sufficient. thanks for clarifying.
@appujee could you please provide options to use for ARM to compiler TSVC benchmark ? Do you have specific options for RISC-V ?
Try -mcpu=cortex-a55 for ARM
For RISC-V rv64gcv
, please share if you have a cpu flag that gives better vectorization.
The number of loops vectorized as-is using upstream LLVM 5c1b8de77d1c:
Arch | Number of vectorized loops |
---|---|
-march= rv64gcv | 1299 |
-mcpu=cortex-a55 | 904 |
Obviously, the performance of vectorized loops is a different aspect, but it won't be easy to answer for RISC-V in general
nice! is it possible to know how many loops we start with in both the cases? Like do we have something like 'number of loops analyzed'. It could be that inlining etc. resulted in different number of loops to begin with.
Updated: my original numbers didn't include loops from tsvc.c
Arch | default fp-model (strict for Clang) | fp-model=strict (same as default) | fp-model=fast | |||
---|---|---|---|---|---|---|
#LoopsAnalyzed | #LoopsVectorized | #LoopsAnalyzed | #LoopsVectorized | #LoopsAnalyzed | #LoopsVectorized | |
-march= rv64gcv | 735 | 460 | 735 | 460 | 735 | 667 |
-mcpu=cortex-a55 | 736 | 176 | 736 | 176 | 735 | 635 |
Details:
fp-model=strict
common.c
: Arch | #LoopsAnalyzed | #LoopsVectorized |
---|---|---|
-march= rv64gcv | 555 | 373 |
-mcpu=cortex-a55 | 555 | 176 |
tsvc.c |
Arch | #LoopsAnalyzed | #LoopsVectorized |
---|---|---|---|
-march= rv64gcv | 180 | 87 | |
-mcpu=cortex-a55 | 181 | 0 |
fp-model=fast
common.c
: Arch | #LoopsAnalyzed | #LoopsVectorized |
---|---|---|
-march= rv64gcv | 555 | 553 |
-mcpu=cortex-a55 | 555 | 552 |
tsvc.c
Arch | #LoopsAnalyzed | #LoopsVectorized |
---|---|---|
-march= rv64gcv | 180 | 114 |
-mcpu=cortex-a55 | 180 | 83 |
That's very promising as RISCV is ahead. I've marked the first item as done. Thanks for helping with this.
There is no instruction scheduling for RISC-V vectors.
vsetvli intrinsics are allowed to CSE as of last week.
As per: https://github.com/llvm/llvm-project/issues/58834 there is still room for removing redundant vsetvlis? cc: @topperc @nikolaypanchenko
There is no instruction scheduling for RISC-V vectors.
vsetvli intrinsics are allowed to CSE as of last week.
As per: llvm/llvm-project#58834 there is still room for removing redundant vsetvlis? cc: @topperc @nikolaypanchenko
The code in that ticket does not look like how the current upstream vectorizer or the proposed VP intrinsic vectorizer from our downstream generate code so I don't think it is directly relevant to autovectorization.
- Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.
I believe @appujee refers to the 3d task:
Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.
which may or may not be treated as any vset*vli
generated by codegen within vectorized loop.
@topperc do you know if anyone started to look at that reported issue ?
I believe @appujee refers to the 3d task:
Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.
which may or may not be treated as any
vset*vli
generated by codegen within vectorized loop.
What we currently have doesn't remove any vsetvlis generated by explicit vsetvli intrinsics. We do a reaching def like analysis to insert additional vsetvlis whereever we think they are needed to satisfy SEW, LMUL, tail policy, mask policy needed for the vector load/store/arithmetic instructions.
@topperc do you know if anyone started to look at that reported issue ?
I don't think anyone has looked at the issue. We have a reaching definition analysis. We detect a mismatch because the preheader edge sees the vsetvli in the preheader and the backedge sees the vsetvli from the previous iteration.
Pipeline cost model for vector instructions: D149495 posted by michaelmaitland. We can use -mcpu=sifive-x280
to try it out.