llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.46k stars 11.76k forks source link

vector ZERO_EXTEND gets scalarized #46476

Closed JonPsson closed 4 years ago

JonPsson commented 4 years ago
Bugzilla Link 47132
Resolution FIXED
Resolved on Sep 08, 2020 08:00
Version trunk
OS Linux
Attachments test case
CC @efriedma-quic,@JonPsson,@uweigand

Extended Description

The (first) DAGCombiner will replace a DAG like:

   t8: v2f64 = BUILD_VECTOR ConstantFP:f64<0.000000e+00>, ConstantFP:f64<0.000000e+00>
 t10: v2i1 = setcc t2, t8, setoeq:ch

t11: v2i32 = zero_extend t10

in visitZERO_EXTEND() to

      t8: v2f64 = BUILD_VECTOR ConstantFP:f64<0.000000e+00>, ConstantFP:f64<0.000000e+00>
    t16: v2i64 = setcc t2, t8, setoeq:ch
  t17: v2i32 = truncate t16
 t19: v2i32 = BUILD_VECTOR Constant:i32<1>, Constant:i32<1>

t20: v2i32 = and t17, t19

This works only with a SETCC operand (DAGCombiner.cpp:10482), so an OR of two SETCC results is currently unhandled, and the type legalizer will then scalarize the zero_extend:

t11: v2i32 = zero_extend t10 WidenVectorResult() WidenVecRes_Convert() -> scalarization

Zero extending a vector of i1 is not quite simple, as it involves first truncating / extending the vector select result, and then and:ing each element with 1. Perhaps it would a custom DAG combine (pre-legalize types) is the right approach given this? Or coudl the common-code that currently just handles a SETCC operand be extended to handle this type of case, maybe?

The DAGTypeLegalizer wants to either widen the InVT (v2i1) or have the InWidenVT to be legal (v4i1), or it will scalarize. I am not sure if that could be made to work, but at first glance it doesn't look right.

llc -O3 -mcpu=z14 -o - ./tc_or_i1.ll -debug-only="isel,legalize-types"

JonPsson commented 4 years ago

6dc3e22

uweigand commented 4 years ago

The <2 x i1> is first promoted to <2 x i64>, and then truncated/widened to

<4 x i32>: The v2i1 SETCC is type-legalized to v4i32 SETCC + v2i64 SIGN_EXTEND_VECTOR_INREG.DAGCombiner then replaces the SIGN_EXTEND_VECTOR_INREG with ANY_EXTEND_VECTOR_INREG. Vector-legalizer then replaces that with VECTOR_SHUFFLE, and then that is lowered to SystemZISD::MERGE_HIGH. So it's part of the promotion from v2i1 to v2i64 via v4i32.

So we end up with two SETCC + VECTOR_SHUFFLE pairs, followed by the OR, followed by the truncation, right? And the truncation is in fact still implemented elementwise (so a BUILD_VECTOR of two scalar TRUNCATEs of EXTRACT_VECTOR_ELT)?

That would make the earlier parts difficult to optimize in DAGCombine since this usually only handles single-use values, and the OR has the two EXTRACT_VECTOR_ELT uses ...

Maybe for such instances it would make sense to have something like a TRUNCATE_VECTOR_INREG after all? Then we would at some point see a sequence of TRUNCATE_VECTOR_INREG (OR (EXTEND_VECTOR_INREG) (EXTEND_VECTOR_INREG)), which could be optimized to a single OR in the inner mode. The equivalent transformation is already done for scalars, I think.

But that is certainly a separate issue.

JonPsson commented 4 years ago

Where does the merge high come from in the first place?

Sorry - this was the test case this time:

define <2 x i32> @​fun(<2 x i32> %Arg1, <2 x i32> %Arg2) { %i3 = icmp eq <2 x i32> %Arg1, zeroinitializer %i5 = icmp eq <2 x i32> %Arg2, zeroinitializer %i6 = or <2 x i1> %i3, %i5 %i7 = zext <2 x i1> %i6 to <2 x i32> ret <2 x i32> %i7 }

The <2 x i1> is first promoted to <2 x i64>, and then truncated/widened to <4 x i32>: The v2i1 SETCC is type-legalized to v4i32 SETCC + v2i64 SIGN_EXTEND_VECTOR_INREG.DAGCombiner then replaces the SIGN_EXTEND_VECTOR_INREG with ANY_EXTEND_VECTOR_INREG. Vector-legalizer then replaces that with VECTOR_SHUFFLE<u, 0, u 1>, and then that is lowered to SystemZISD::MERGE_HIGH. So it's part of the promotion from v2i1 to v2i64 via v4i32.

Elis patch proposed at https://reviews.llvm.org/D86268.

uweigand commented 4 years ago

The insert_vector_elt that fails during isel is:

t111: v4i32 = BUILD_VECTOR undef:i64, Constant:i64<0>, Constant:i64<0>, undef:i64 t78: i64 = sign_extend t100 t112: v4i32 = insert_vector_elt t111, t78, Constant:i32<0>

I thought this was a missing pattern for a truncating insert_vector_elt. But I >suppose this is not worth more attention now since we got rid of that problem...

Ah, I guess this is because the VLVG{B,H,F} patterns require the element to be in register class GR32? It should be possible to fix this.

I am not sure how to best address this, and perhaps this further improvement >belongs in a separate patch...

Agreed. At first glance, it looks like this something the back-end general shuffle / pack logic should be able to handle.

However, I'm not sure exactly how this code is being generated. Where does the merge high come from in the first place?

JonPsson commented 4 years ago

Hmmm... perhaps this promotion issue really just belongs to the test case at hand: with Elis patch I see no additional vector merge instructions on benchmarks...

JonPsson commented 4 years ago

You could argue there's a bug in the SystemZ backend where it's incorrectly marking nodes legal, but really, there isn't any reason for us to generate insert_vector_elt etc. with overwide scalar types.

The insert_vector_elt that fails during isel is:

t111: v4i32 = BUILD_VECTOR undef:i64, Constant:i64<0>, Constant:i64<0>, undef:i64 t78: i64 = sign_extend t100 t112: v4i32 = insert_vector_elt t111, t78, Constant:i32<0>

I thought this was a missing pattern for a truncating insert_vector_elt. But I suppose this is not worth more attention now since we got rid of that problem...

Anyway.. with your patch we now avoid element extraction which is expensive so this is a very nice improvement. However, since the SETCC of v2i32 is first promoted to v2i64, and then truncated back to v2i32, I see a room for further improvement:

trunk:

fun: # @​fun .cfi_startproc

%bb.0:

    vgbm    %v0, 0
    vceqf   %v1, %v24, %v0
    vceqf   %v0, %v26, %v0
    vuphf   %v1, %v1
    vuphf   %v0, %v0
    vo      %v0, %v1, %v0
    vlgvf   %r1, %v0, 3
    vlgvf   %r0, %v0, 1
    nilf    %r1, 1
    vlvgp   %v24, %r1, %r1
    nilf    %r0, 1
    vlvgf   %v24, %r0, 0
    br      %r14

With your patch:

fun: # @​fun .cfi_startproc

%bb.0:

    vgbm    %v0, 0
    vceqf   %v1, %v24, %v0
    vceqf   %v0, %v26, %v0
    vmrhf   %v1, %v1, %v1     // <-
    vmrhf   %v0, %v0, %v0     // <-
    vo      %v0, %v1, %v0
    vrepig  %v1, 1
    vn      %v0, %v0, %v1
    vpkg    %v24, %v0, %v0     // <-
    br      %r14

The vector merge-high and vector pack instrctions should not be needed it seems to me:

[ 0 1 2 3] = vceqf %v1 [ 4 5 6 7] = vceqf %v0 [ 0 0 1 1] = vmrhf %v1, %v1 [ 4 4 5 5] = vmrhf %v0, %v0 [ 8 9 a b] = vo; vn [ 9 b 9 b] = vpgk %v0, %v0

Without the extension to v2i64 and truncation back to v4i32:

[ 0 1 2 3] = vceqf %v1 [ 4 5 6 7] = vceqf %v0 [ 9 b - -] = vo; vn

fun: # @​fun .cfi_startproc

%bb.0:

    vgbm    %v0, 0
    vceqf   %v1, %v24, %v0
    vceqf   %v0, %v26, %v0
    vo      %v0, %v1, %v0
    vrepig  %v1, 1
    vn      %v0, %v0, %v1
    br      %r14

I am not sure how to best address this, and perhaps this further improvement belongs in a separate patch... Right now I see that PromoteIntRes_SETCC() handles the SETCC node by creating a new SETCC with the getSetCCResultType() result type, and then doing SExtOrTrunc to TLI.getTypeToTransfomrTo(v2i1).

While in the general case I would think that the promotion of a vector of i1s must follow the general model, it seems that in certain cases like this should not have to. More specifically, if the result is only used by a (root) ZERO_EXTEND the promotion should go directly to that result type.

What if the target could decide on the type to promote to? In this case the SystemZ target could analyze the DAG and decide that it can promote to v2i32 (and also widen to v4i32?).

Alternatively, maybe this could be done as a target DAGCombine before TypeLegalizer?

efriedma-quic commented 4 years ago

Oh, maybe

I hacked it together quickly as a proof of concept, so I'm not completely surprised I broke something.

Looking at the patch again, I think the second hunk is wrong; we shouldn't need to mess with the type of the BUILD_VECTOR operands. I think it was avoiding a crash somehow, but probably I didn't diagnose the root cause correctly.

My test case didn't seem to produce a bad DAG with that second hunk, which made the truncate become a no-op which is still correct since the BUILD_VECTOR performs an implicit truncation. However, without the second hunk the truncate node is created, which causes the DAG combiner to behave differently and the test case no longer fails.

You could argue there's a bug in the SystemZ backend where it's incorrectly marking nodes legal, but really, there isn't any reason for us to generate insert_vector_elt etc. with overwide scalar types.

It makes sense to me to remove the second hunk I think, but I have not seen any crash without it like you mentioned... I tried to make a test case that had a wider WidenVT (comparing v8i16 and zero extending the i1 vector to v2i64, but that ended up to be handled by splitting, so it didn't seem to reach here.

Okay, that seems fine.

JonPsson commented 4 years ago

I hacked it together quickly as a proof of concept, so I'm not completely surprised I broke something.

Looking at the patch again, I think the second hunk is wrong; we shouldn't need to mess with the type of the BUILD_VECTOR operands. I think it was avoiding a crash somehow, but probably I didn't diagnose the root cause correctly.

My test case didn't seem to produce a bad DAG with that second hunk, which made the truncate become a no-op which is still correct since the BUILD_VECTOR performs an implicit truncation. However, without the second hunk the truncate node is created, which causes the DAG combiner to behave differently and the test case no longer fails.

It makes sense to me to remove the second hunk I think, but I have not seen any crash without it like you mentioned... I tried to make a test case that had a wider WidenVT (comparing v8i16 and zero extending the i1 vector to v2i64, but that ended up to be handled by splitting, so it didn't seem to reach here.

I could now build SPEC (17) and it seems that there are some 25 cases where this triggers, which is kind of what I was expecting:

nilf : 10770 10723 -47 vlgvf : 7385 7338 -47 vn : 2091 2116 +25 vpkg : 914 939 +25 vlvgf : 2698 2673 -25 vlvgp : 9634 9611 -23 vrepig : 1503 1513 +10 ...

The SystemZ/CodeGen tests are also passing.

efriedma-quic commented 4 years ago

I hacked it together quickly as a proof of concept, so I'm not completely surprised I broke something.

Looking at the patch again, I think the second hunk is wrong; we shouldn't need to mess with the type of the BUILD_VECTOR operands. I think it was avoiding a crash somehow, but probably I didn't diagnose the root cause correctly.

JonPsson commented 4 years ago

reduced testcase (for Elis patch)

Jonas, how does your original test case perform with Eli's patch?

The patch seemed to work fine also on whole benchmark files with vectorized loops, but when I tried to build all of SPEC, I ran into a problem during isel:

llc -O3 -mcpu=z14 -o - ./tc_widenvecres_convert_FAIL.ll

LLVM ERROR: Cannot select: t131: v4i32 = insert_vector_elt t130, t84, Constant:i32<0> t130: v4i32 = BUILD_VECTOR undef:i64, Constant:i64<0>, Constant:i64<0>, undef:i64 t89: i64 = undef t3: i64 = Constant<0> t3: i64 = Constant<0> t89: i64 = undef t84: i64 = sign_extend t117 t117: i32 = sub Constant:i32<0>, t116 t41: i32 = Constant<0> t116: i32 = and t72, Constant:i32<1> t72: i32 = extract_vector_elt t70, Constant:i32<3> t70: v4i32 = bitcast t39 t39: v2i64 = xor t140, t33 t140: v2i64 = SystemZISD::REPLICATE t118 t118: i64 = any_extend t125 t125: i32 = SystemZISD::SELECT_CCMASK Constant:i32<1>, Constant:i32<0>, TargetConstant:i32<14>, TargetConstant:i32<6>, t122

          t33: v2i64 = BUILD_VECTOR Constant:i64<1>, Constant:i64<1>
            t32: i64 = Constant<1>
            t32: i64 = Constant<1>
      t71: i32 = Constant<3>
    t7: i32 = Constant<1>

t41: i32 = Constant<0> In function: fun

uweigand commented 4 years ago

Hmm. Looks like the main problem is that the zero_extend node involves two types that prefer different legalization strategies:

v2i1 would prefer promotion to v2i64 v2i32 would prefer widening to v4i32

However, the main loop in DAGTypeLegalizer::run initially only uses the result type to decide a strategy, which implies widening here.

What we really want here is to use the promoted v2i64 input, and transform it into a v4i32 output, which here means to perform a widened truncate.

And this is in fact pretty much what the proposed patch does, so that looks good to me. It is true that we still fall down into the BUILD_VECTOR case. We avoid this for the extension cases by making use of the *_EXTEND_VECTOR_INREG opcodes. I guess we could do the same for truncation if we had an analogous TRUNCATE_VECTOR_INREG operation.

On the other hand, there's probably not much semantic difference between a hypothetical TRUNCATE_VECTOR_INREG and open-coding the same via a BUILD_VECTOR of TRUNCATEs, which may be why those optimize well.

Jonas, how does your original test case perform with Eli's patch?

efriedma-quic commented 4 years ago

Maybe we could try to do something in WidenVecRes_Convert? It's not clear what, exactly, we would do. An arbitrary v2i1 will promote to v2i64; I guess we could perform that promotion in WidenVecRes_Convert? Something like the following; we still end up with a BUILD_VECTOR, but it seems to optimize more effectively.

diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp index 756a210..6d8833f 100644 --- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp @@ -3290,6 +3290,18 @@ SDValue DAGTypeLegalizer::WidenVecRes_Convert(SDNode *N) { unsigned Opcode = N->getOpcode(); unsigned InVTNumElts = InVT.getVectorNumElements(); const SDNodeFlags Flags = N->getFlags(); +