Investigate the current state of Auto-vectorization for RISC-V targets

appujee commented 1 year ago

[x] Compiling TSVC benchmark would be a good way to find out if commonly found loop structures are getting vectorized.
[ ] Instruction scheduling of vectorized loops. If there is no instruction scheduling for RISC-V vectors, then we might have to create a separate task for this.
[x] Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

topperc commented 1 year ago

There is no instruction scheduling for RISC-V vectors.

vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

nikolaypanchenko commented 1 year ago

For the

Compiling TSVC benchmark would be a good way to find out if commonly found loop structures are getting vectorized.

What is a success criteria ? Is it about comparing vectorizable (i.e. that will be vectorized if heuristics are disabled) vs AArch64 and X86 ?

idbaev commented 1 year ago

It will be great to get examples from Android where auto-vectorization for RISC-V Vectors (RVV) is not performing as expected, e.g. compared with X86 and AArch64 targets.

appujee commented 1 year ago

What is a success criteria ? Is it about comparing vectorizable (i.e. that will be vectorized if heuristics are disabled) vs AArch64 and X86 ?

It is usually a good exercise to compare number of vectorizable loops, it helps tune the vectorizer and find more opportunities for vectorization. We should do this comparative analysis at least w.r.t. AArch64.

Once this is done, we can do a comparative analysis on larger code base like Android.

appujee commented 1 year ago

There is no instruction scheduling for RISC-V vectors.

vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

I was thinking if inlining may bring up redundant vsetvlis. In functions with multiple loops this can also happen.

topperc commented 1 year ago

There is no instruction scheduling for RISC-V vectors. vsetvls that aren't explicitly from vsetvl/vsetvlmax intrinsics are inserted on demand. All instructions are created with extra operands holding their lmul, sew, and policy bits. This information is used to insert vsetvlis if they are different from what is available based on instructions or basic blocks preceding them. The code is in llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cp.

I was thinking if inlining may bring up redundant vsetvlis.

vsetvli intrinsics are allowed to CSE as of last week.

In functions with multiple loops this can also happen.

This depends on what style of vector loop you're right. If you're using vsetvli inside the loop to avoid tail iterations then you'll need vsetvlis inside each loop so none are redundant. If you're using vsetvlmax and operating on whole registers in the loop then yes there could be a redundant one for each loop.

The current loop vectorizer operates on whole registers but doesn't use vsetvli intrinsics. The vsetvlis are all inserted by the insertion pass which runs just before machine IR leaves SSA form.

appujee commented 1 year ago

vsetvli intrinsics are allowed to CSE as of last week.

ah ok. this should be sufficient. thanks for clarifying.

nikolaypanchenko commented 1 year ago

@appujee could you please provide options to use for ARM to compiler TSVC benchmark ? Do you have specific options for RISC-V ?

appujee commented 1 year ago

Try -mcpu=cortex-a55 for ARM

For RISC-V rv64gcv, please share if you have a cpu flag that gives better vectorization.

nikolaypanchenko commented 1 year ago

The number of loops vectorized as-is using upstream LLVM 5c1b8de77d1c:

Arch	Number of vectorized loops
-march= rv64gcv	1299
-mcpu=cortex-a55	904

Obviously, the performance of vectorized loops is a different aspect, but it won't be easy to answer for RISC-V in general

appujee commented 1 year ago

nice! is it possible to know how many loops we start with in both the cases? Like do we have something like 'number of loops analyzed'. It could be that inlining etc. resulted in different number of loops to begin with.

nikolaypanchenko commented 1 year ago

Updated: my original numbers didn't include loops from tsvc.c

Arch	default fp-model (strict for Clang)		fp-model=strict (same as default)		fp-model=fast
	#LoopsAnalyzed	#LoopsVectorized	#LoopsAnalyzed	#LoopsVectorized	#LoopsAnalyzed	#LoopsVectorized
-march= rv64gcv	735	460	735	460	735	667
-mcpu=cortex-a55	736	176	736	176	735	635

Details:

with fp-model=strict common.c:

Arch	#LoopsAnalyzed	#LoopsVectorized
-march= rv64gcv	555	373
-mcpu=cortex-a55	555	176

`tsvc.c`	Arch	#LoopsAnalyzed	#LoopsVectorized
-march= rv64gcv	180	87
-mcpu=cortex-a55	181	0

with fp-model=fast common.c:

Arch	#LoopsAnalyzed	#LoopsVectorized
-march= rv64gcv	555	553
-mcpu=cortex-a55	555	552

tsvc.c

Arch	#LoopsAnalyzed	#LoopsVectorized
-march= rv64gcv	180	114
-mcpu=cortex-a55	180	83

appujee commented 1 year ago

That's very promising as RISCV is ahead. I've marked the first item as done. Thanks for helping with this.

appujee commented 1 year ago

There is no instruction scheduling for RISC-V vectors.

vsetvli intrinsics are allowed to CSE as of last week.

As per: https://github.com/llvm/llvm-project/issues/58834 there is still room for removing redundant vsetvlis? cc: @topperc @nikolaypanchenko

topperc commented 1 year ago

There is no instruction scheduling for RISC-V vectors.

vsetvli intrinsics are allowed to CSE as of last week.

As per: llvm/llvm-project#58834 there is still room for removing redundant vsetvlis? cc: @topperc @nikolaypanchenko

The code in that ticket does not look like how the current upstream vectorizer or the proposed VP intrinsic vectorizer from our downstream generate code so I don't think it is directly relevant to autovectorization.

nikolaypanchenko commented 1 year ago

Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

I believe @appujee refers to the 3d task:

Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

which may or may not be treated as any vset*vli generated by codegen within vectorized loop. @topperc do you know if anyone started to look at that reported issue ?

topperc commented 1 year ago

I believe @appujee refers to the 3d task:

Eliminate redundant vsetvl instructions. If we are not doing it already, a simple reaching definition analysis should accomplish this.

which may or may not be treated as any vset*vli generated by codegen within vectorized loop.

What we currently have doesn't remove any vsetvlis generated by explicit vsetvli intrinsics. We do a reaching def like analysis to insert additional vsetvlis whereever we think they are needed to satisfy SEW, LMUL, tail policy, mask policy needed for the vector load/store/arithmetic instructions.

@topperc do you know if anyone started to look at that reported issue ?

I don't think anyone has looked at the issue. We have a reaching definition analysis. We detect a mismatch because the preheader edge sees the vsetvli in the preheader and the backedge sees the vsetvli from the previous iteration.

appujee commented 1 year ago

Pipeline cost model for vector instructions: D149495 posted by michaelmaitland. We can use -mcpu=sifive-x280 to try it out.

google / android-riscv64

Investigate the current state of Auto-vectorization for RISC-V targets #23