llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.9k stars 11.51k forks source link

[InstCombine] Fails to combine three shufflevectors produced by LoopVectorizer #38140

Open JonPsson opened 6 years ago

JonPsson commented 6 years ago
Bugzilla Link 38792
Version trunk
OS Linux
Attachments reduced testcase
CC @topperc,@efriedma-quic,@hfinkel,@RKSimon,@JonPsson,@rotateright,@uweigand

Extended Description

The LoopVectorizer has interleaved loads and stores. Basically, the loaded elements should pairwise be reversed like

[0 1 2 3] -> [1 0 3 2]

The Vectorizer does not understand this but generates from two interleave groups first a result for the load group, and then makes another shuffle for the store group

%tmp6 = load <4 x i64>, <4 x i64> %tmp5, align 8 %tmp7 = shufflevector <4 x i64> %tmp6, <4 x i64> undef, <2 x i32> <i32 0, i32 2> %tmp8 = shufflevector <4 x i64> %tmp6, <4 x i64> undef, <2 x i32> <i32 1, i32 3> %tmp9 = shufflevector <2 x i64> %tmp8, <2 x i64> %tmp7, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %tmp10 = shufflevector <4 x i64> %tmp9, <4 x i64> undef, <4 x i32> <i32 0, i32 2, i32 1, i32 3> store <4 x i64> %tmp10, <4 x i64> undef, align 8

This results in [1 0 3 2], and I would have hoped that this would become a single shufflevector after instcombine, but this does not happen.

There are comments in InstCombine that this is purposely done very conservatively. It is however clear that this does not give good code on SystemZ.

I wonder if anyone has any idea if InstCombiner should handle this case, or if not, where should this be done. A custom DAGCombine by the target?

bin/opt ./tc_instcombine.ll -mtriple=systemz-unknown -mcpu=z13 -S -o out.opt.ll -instcombine

rotateright commented 6 years ago

Thanks for the feedback!

I'm not seeing how to do this in a target-independent way in instcombine. As noted, we're not allowed to generate shuffle masks that are not obviously lower-able on any target.

SystemZ has a vector permute instruction that can take two (or one) input vectors and make any desired permutation of the elements. Therefore shuffles have the same cost regardless of the indices, as long as all the indices fit inside two vectors. If this is true for other targets also, we could add a method hasVectorPermute() or so and if that method returns yes, then allow InstCombine to choose indices freely. Or is this more complex than that?

Or perhaps InstCombine could use the TTI->getShuffleCost() method to ask target and compare costs for different shuffles?

It seems cleaner to have InstCombine do this (since it is already claiming to do so) rather than having the last-to-run vectorizer (SLP) clean up for earlier vectorizations, or?

Is there a difference between shuffles produced by vectorizers and those in the input program, in regards to if Instcombine should preserve them? If that would be a general rule, then it seems each vectorizer would have to generate shuffles optimally since InstCombine can't improve them.

No, instcombine has no way to determine how the instruction was created AFAIK (metadata?), so that's why it is taking the safe/conservative path.

These questions lead back to llvm-dev discussions about the role of instcombine. Instcombine is not the place for target-specific optimizations (although it does have them for target-specific intrinsics, I think that's considered a mistake).

The general argument is that everyone wants to add functionality to instcombine because it's the easiest solution, but it's too costly in compile-time to do that. That's because instcombine is not particularly efficient, and it appears many (8?) times in the typical -O2 optimization pipeline.

I understand that this is extra muddy for shufflevector because we bypass obviously good IR transforms because we know the backend can't always recover. For simpler ops, we can just add some reversal logic. See for example this pair of patches: https://reviews.llvm.org/D50992 https://reviews.llvm.org/D51553

But that may not be possible for shufflevector.

We could have a separate IR pass specifically for vector op ( http://llvm.org/docs/LangRef.html#vector-operations ) cleanup. It would look similar to the existing instcombine for vector ops, but allow use of the cost model, and only run after the vectorizers or in some other limited way. We might also be able to move and improve InstCombiner::SimplifyDemandedVectorElts() by creating this new pass. If that sounds reasonable, it should be brought up for discussion on llvm-dev.

JonPsson commented 6 years ago

Thanks for the feedback!

I'm not seeing how to do this in a target-independent way in instcombine. As noted, we're not allowed to generate shuffle masks that are not obviously lower-able on any target.

SystemZ has a vector permute instruction that can take two (or one) input vectors and make any desired permutation of the elements. Therefore shuffles have the same cost regardless of the indices, as long as all the indices fit inside two vectors. If this is true for other targets also, we could add a method hasVectorPermute() or so and if that method returns yes, then allow InstCombine to choose indices freely. Or is this more complex than that?

Or perhaps InstCombine could use the TTI->getShuffleCost() method to ask target and compare costs for different shuffles?

It seems cleaner to have InstCombine do this (since it is already claiming to do so) rather than having the last-to-run vectorizer (SLP) clean up for earlier vectorizations, or?

Is there a difference between shuffles produced by vectorizers and those in the input program, in regards to if Instcombine should preserve them? If that would be a general rule, then it seems each vectorizer would have to generate shuffles optimally since InstCombine can't improve them.

rotateright commented 6 years ago

Another possibility: if the vectorizers are creating these sub-optimal sequences, can they be enhanced to squash them as part of their job? It used cost models to produce the shuffles in the 1st place, so it can't be much of a stretch to use those same models to simplify chains of shuffles? Maybe part of SLP's responsibility?

rotateright commented 6 years ago

If there was reason to believe that "shuffle [0 2 1 3]" is no more difficult than "shuffle [1 0 3 2]" on any target, then it would be possible.

But barring that, I'm not seeing how to do this in a target-independent way in instcombine. As noted, we're not allowed to generate shuffle masks that are not obviously lower-able on any target.

So yes, a DAGCombine should do it. It would be nice if we could make these semi-easy common cases available in DAGCombiner using a target-hook to enable them, but even that is probably hard to determine.

x86 gets this with -mattr=avx, so you should be able to replicate whatever it is doing here:

define <4 x i64> @​shufsimp(<4 x i64> %x) { %tmp7 = shufflevector <4 x i64> %x, <4 x i64> undef, <2 x i32> <i32 0, i32 2> %tmp8 = shufflevector <4 x i64> %x, <4 x i64> undef, <2 x i32> <i32 1, i32 3> %tmp9 = shufflevector <2 x i64> %tmp8, <2 x i64> %tmp7, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %tmp10 = shufflevector <4 x i64> %tmp9, <4 x i64> undef, <4 x i32> <i32 0, i32 2, i32 1, i32 3> ret <4 x i64> %tmp10 }

$ llc -o - shuf.ll -mattr=avx ... vpermilpd $5, %ymm0, %ymm0 ## ymm0 = ymm0[1,0,3,2]