[InstCombine][X86] Failure to replace @llvm.x86.sse41.pblendvb with select

RKSimon commented 1 year ago

In many cases we can replace SSE pblendvb intrinsics with select nodes, by determining that the condition element is a sign-extended compare result (or logic combination of them).

But in some circumstance the logic fails to simplify and we end up stuck with the pblendvb intrinsics, which prevents further generic folds from occurring.

#include <x86intrin.h>
auto tricky(__m128i a, __m128i b, __m128i c, __m128i src) {
    // Valid (> 0) weights
    __m128i aValid = _mm_cmpgt_epi32( a, _mm_setzero_si128() );
    __m128i bValid = _mm_cmpgt_epi32( b, _mm_setzero_si128() );
    __m128i cValid = _mm_cmpgt_epi32( c, _mm_setzero_si128() );
    __m128i bothValid = _mm_and_si128( aValid, bValid );
    __m128i allValid = _mm_xor_si128( bothValid, cValid );

    // Force a / b
    __m128i forceA = _mm_and_si128( allValid, aValid );
    __m128i forceB = _mm_and_si128( allValid, bValid );

    // Determine output
    __m128i out = _mm_and_si128( src, bothValid );
    out = _mm_blendv_epi8( out, a, forceA );
    out = _mm_blendv_epi8( out, b, forceB );    
    return out;
}

define <2 x i64> @tricky(<2 x i64> noundef %a, <2 x i64> noundef %b, <2 x i64> noundef %c, <2 x i64> noundef %src) {
entry:
  %0 = bitcast <2 x i64> %a to <4 x i32>
  %cmp.i = icmp sgt <4 x i32> %0, zeroinitializer
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %1 = bitcast <4 x i32> %sext.i to <2 x i64>
  %2 = bitcast <2 x i64> %b to <4 x i32>
  %cmp.i21 = icmp sgt <4 x i32> %2, zeroinitializer
  %sext.i22 = sext <4 x i1> %cmp.i21 to <4 x i32>
  %3 = bitcast <4 x i32> %sext.i22 to <2 x i64>
  %4 = bitcast <2 x i64> %c to <4 x i32>
  %cmp.i23 = icmp sgt <4 x i32> %4, zeroinitializer
  %sext.i24 = sext <4 x i1> %cmp.i23 to <4 x i32>
  %5 = bitcast <4 x i32> %sext.i24 to <2 x i64>
  %and.i = and <2 x i64> %3, %1
  %xor.i = xor <2 x i64> %and.i, %5
  %and.i25 = and <2 x i64> %xor.i, %1
  %and.i26 = and <2 x i64> %xor.i, %3
  %and.i27 = and <2 x i64> %and.i, %src
  %6 = bitcast <2 x i64> %and.i27 to <16 x i8>
  %7 = bitcast <2 x i64> %a to <16 x i8>
  %8 = bitcast <2 x i64> %and.i25 to <16 x i8>
  %9 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %6, <16 x i8> %7, <16 x i8> %8)
  %10 = bitcast <2 x i64> %b to <16 x i8>
  %11 = bitcast <2 x i64> %and.i26 to <16 x i8>
  %12 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %9, <16 x i8> %10, <16 x i8> %11)
  %13 = bitcast <16 x i8> %12 to <2 x i64>
  ret <2 x i64> %13
}
declare <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8>, <16 x i8>, <16 x i8>)

I don't know if its the endless bitcasts to/from <2 x i64> due to the __m128i type, or if something else is going on.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-x86

https://godbolt.org/z/n8qYT9hvc In many cases we can replace SSE pblendvb intrinsics with select nodes, by determining that the condition element is a sign-extended compare result (or logic combination of them). But in some circumstance the logic fails to simplify and we end up stuck with the pblendvb intrinsics, which prevents further generic folds from occurring. ```c #include auto tricky(__m128i a, __m128i b, __m128i c, __m128i src) { // Valid (> 0) weights __m128i aValid = _mm_cmpgt_epi32( a, _mm_setzero_si128() ); __m128i bValid = _mm_cmpgt_epi32( b, _mm_setzero_si128() ); __m128i cValid = _mm_cmpgt_epi32( c, _mm_setzero_si128() ); __m128i bothValid = _mm_and_si128( aValid, bValid ); __m128i allValid = _mm_xor_si128( bothValid, cValid ); // Force a / b __m128i forceA = _mm_and_si128( allValid, aValid ); __m128i forceB = _mm_and_si128( allValid, bValid ); // Determine output __m128i out = _mm_and_si128( src, bothValid ); out = _mm_blendv_epi8( out, a, forceA ); out = _mm_blendv_epi8( out, b, forceB ); return out; } ``` ```ll define <2 x i64> @tricky(<2 x i64> noundef %a, <2 x i64> noundef %b, <2 x i64> noundef %c, <2 x i64> noundef %src) { entry: %0 = bitcast <2 x i64> %a to <4 x i32> %cmp.i = icmp sgt <4 x i32> %0, zeroinitializer %sext.i = sext <4 x i1> %cmp.i to <4 x i32> %1 = bitcast <4 x i32> %sext.i to <2 x i64> %2 = bitcast <2 x i64> %b to <4 x i32> %cmp.i21 = icmp sgt <4 x i32> %2, zeroinitializer %sext.i22 = sext <4 x i1> %cmp.i21 to <4 x i32> %3 = bitcast <4 x i32> %sext.i22 to <2 x i64> %4 = bitcast <2 x i64> %c to <4 x i32> %cmp.i23 = icmp sgt <4 x i32> %4, zeroinitializer %sext.i24 = sext <4 x i1> %cmp.i23 to <4 x i32> %5 = bitcast <4 x i32> %sext.i24 to <2 x i64> %and.i = and <2 x i64> %3, %1 %xor.i = xor <2 x i64> %and.i, %5 %and.i25 = and <2 x i64> %xor.i, %1 %and.i26 = and <2 x i64> %xor.i, %3 %and.i27 = and <2 x i64> %and.i, %src %6 = bitcast <2 x i64> %and.i27 to <16 x i8> %7 = bitcast <2 x i64> %a to <16 x i8> %8 = bitcast <2 x i64> %and.i25 to <16 x i8> %9 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %6, <16 x i8> %7, <16 x i8> %8) %10 = bitcast <2 x i64> %b to <16 x i8> %11 = bitcast <2 x i64> %and.i26 to <16 x i8> %12 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %9, <16 x i8> %10, <16 x i8> %11) %13 = bitcast <16 x i8> %12 to <2 x i64> ret <2 x i64> %13 } declare <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8>, <16 x i8>, <16 x i8>) ``` I don't know if its the endless bitcasts to/from <2 x i64> due to the __m128i type, or if something else is going on.

llvmbot commented 8 months ago

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

In the comments of the issue, request for it to be assigned to you.
Fix the issue locally.
Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
Create a Git commit.
Run git clang-format HEAD~1 to format your changes.
Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

llvmbot commented 8 months ago

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

https://godbolt.org/z/n8qYT9hvc In many cases we can replace SSE pblendvb intrinsics with select nodes, by determining that the condition element is a sign-extended compare result (or logic combination of them). But in some circumstance the logic fails to simplify and we end up stuck with the pblendvb intrinsics, which prevents further generic folds from occurring. ```c #include <x86intrin.h> auto tricky(__m128i a, __m128i b, __m128i c, __m128i src) { // Valid (> 0) weights __m128i aValid = _mm_cmpgt_epi32( a, _mm_setzero_si128() ); __m128i bValid = _mm_cmpgt_epi32( b, _mm_setzero_si128() ); __m128i cValid = _mm_cmpgt_epi32( c, _mm_setzero_si128() ); __m128i bothValid = _mm_and_si128( aValid, bValid ); __m128i allValid = _mm_xor_si128( bothValid, cValid ); // Force a / b __m128i forceA = _mm_and_si128( allValid, aValid ); __m128i forceB = _mm_and_si128( allValid, bValid ); // Determine output __m128i out = _mm_and_si128( src, bothValid ); out = _mm_blendv_epi8( out, a, forceA ); out = _mm_blendv_epi8( out, b, forceB ); return out; } ``` ```ll define <2 x i64> @tricky(<2 x i64> noundef %a, <2 x i64> noundef %b, <2 x i64> noundef %c, <2 x i64> noundef %src) { entry: %0 = bitcast <2 x i64> %a to <4 x i32> %cmp.i = icmp sgt <4 x i32> %0, zeroinitializer %sext.i = sext <4 x i1> %cmp.i to <4 x i32> %1 = bitcast <4 x i32> %sext.i to <2 x i64> %2 = bitcast <2 x i64> %b to <4 x i32> %cmp.i21 = icmp sgt <4 x i32> %2, zeroinitializer %sext.i22 = sext <4 x i1> %cmp.i21 to <4 x i32> %3 = bitcast <4 x i32> %sext.i22 to <2 x i64> %4 = bitcast <2 x i64> %c to <4 x i32> %cmp.i23 = icmp sgt <4 x i32> %4, zeroinitializer %sext.i24 = sext <4 x i1> %cmp.i23 to <4 x i32> %5 = bitcast <4 x i32> %sext.i24 to <2 x i64> %and.i = and <2 x i64> %3, %1 %xor.i = xor <2 x i64> %and.i, %5 %and.i25 = and <2 x i64> %xor.i, %1 %and.i26 = and <2 x i64> %xor.i, %3 %and.i27 = and <2 x i64> %and.i, %src %6 = bitcast <2 x i64> %and.i27 to <16 x i8> %7 = bitcast <2 x i64> %a to <16 x i8> %8 = bitcast <2 x i64> %and.i25 to <16 x i8> %9 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %6, <16 x i8> %7, <16 x i8> %8) %10 = bitcast <2 x i64> %b to <16 x i8> %11 = bitcast <2 x i64> %and.i26 to <16 x i8> %12 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %9, <16 x i8> %10, <16 x i8> %11) %13 = bitcast <16 x i8> %12 to <2 x i64> ret <2 x i64> %13 } declare <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8>, <16 x i8>, <16 x i8>) ``` I don't know if its the endless bitcasts to/from <2 x i64> due to the __m128i type, or if something else is going on.

anematode commented 8 months ago

(This is my first time tinkering with LLVM, so please bear with me.)

The relevant pass is

https://github.com/llvm/llvm-project/blob/966b026785a09ec079e8b0ba79358892fcb958ad/llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp#L2697-L2701

which ignores bitcasts and checks whether the original mask is a sign-extended i1 vector. In the missed-optimization case the mask is instead an i64 x 2, because the comparisons are being sign-extended too early and are combined as wider vectors.

If I just remove the one-use check at

https://github.com/llvm/llvm-project/blob/300425cea51ef566a4d38e57afd9a7ae8024a682/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1787-L1793

then everything successfully becomes select:

define noundef <2 x i64> @_Z6trickyDv2_xS_S_S_(<2 x i64> noundef %a, <2 x i64> noundef %b, <2 x i64> noundef %c, <2 x i64> noundef %src) local_unnamed_addr #0 {
entry:
  %0 = bitcast <2 x i64> %a to <4 x i32>
  %cmp.i23 = icmp sgt <4 x i32> %0, zeroinitializer
  %1 = bitcast <2 x i64> %b to <4 x i32>
  %cmp.i21 = icmp sgt <4 x i32> %1, zeroinitializer
  %2 = bitcast <2 x i64> %c to <4 x i32>
  %cmp.i = icmp sgt <4 x i32> %2, zeroinitializer
  %and.i272829 = and <4 x i1> %cmp.i21, %cmp.i23
  %xor.i3031 = xor <4 x i1> %and.i272829, %cmp.i
  %and.i263233 = and <4 x i1> %xor.i3031, %cmp.i23
  %and.i253435 = and <4 x i1> %xor.i3031, %cmp.i21
  %3 = bitcast <2 x i64> %src to <4 x i32>
  %4 = select <4 x i1> %and.i272829, <4 x i32> %3, <4 x i32> zeroinitializer
  %5 = select <4 x i1> %and.i263233, <4 x i32> %0, <4 x i32> %4
  %6 = select <4 x i1> %and.i253435, <4 x i32> %1, <4 x i32> %5
  %7 = bitcast <4 x i32> %6 to <2 x i64>
  ret <2 x i64> %7
}

This behavior makes sense because aValid and bValid are both used twice and so the transformation is skipped. But I'm uncertain whether this is a reasonable change, and it's fairly brittle because a lot of combining transformations on i1-vectors apparently have this kind of check.

RKSimon commented 8 months ago

Yes, that was my concern when I reported it - I started looking at the "other end" and whether we could delay the sext/bitcasts further so the logic is performed on vXi1 types, but then got distracted (as usual......).

Or we try a lot harder to remove all the bitcasts to/from vXi64 around the logic - SSE/AVX code suffers from this a lot as the intrinsics (_mm_and_si128 etc.) all cast to/from v2i64.

It could be that this needs to be put inside VectorCombine instead where we're allowed to do costs comparisons.

anematode commented 8 months ago

What if the one-use check were amended to

Cast0Src->getType()->isVectorTy() || Cast0->hasOneUse() || Cast1->hasOneUse()

or even an additional check for an i1 vector width. (Since that case is more likely to benefit from this transformation than scalar code.) For reference, here is the commit where the check was added.

RKSimon commented 8 months ago

@anematode Removing one use limits doesn't usually work out well

llvm / llvm-project

[InstCombine][X86] Failure to replace @llvm.x86.sse41.pblendvb with select #66513