Open RKSimon opened 1 year ago
@llvm/issue-subscribers-backend-x86
Hi!
This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:
test/
create fine-grained testing targets, so you can e.g. use make check-clang-ast
to only run Clang's AST tests.git clang-format HEAD~1
to format your changes.If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.
@llvm/issue-subscribers-good-first-issue
Author: Simon Pilgrim (RKSimon)
(This is my first time tinkering with LLVM, so please bear with me.)
The relevant pass is
which ignores bitcasts and checks whether the original mask is a sign-extended i1 vector. In the missed-optimization case the mask is instead an i64 x 2
, because the comparisons are being sign-extended too early and are combined as wider vectors.
If I just remove the one-use check at
then everything successfully becomes select
:
define noundef <2 x i64> @_Z6trickyDv2_xS_S_S_(<2 x i64> noundef %a, <2 x i64> noundef %b, <2 x i64> noundef %c, <2 x i64> noundef %src) local_unnamed_addr #0 {
entry:
%0 = bitcast <2 x i64> %a to <4 x i32>
%cmp.i23 = icmp sgt <4 x i32> %0, zeroinitializer
%1 = bitcast <2 x i64> %b to <4 x i32>
%cmp.i21 = icmp sgt <4 x i32> %1, zeroinitializer
%2 = bitcast <2 x i64> %c to <4 x i32>
%cmp.i = icmp sgt <4 x i32> %2, zeroinitializer
%and.i272829 = and <4 x i1> %cmp.i21, %cmp.i23
%xor.i3031 = xor <4 x i1> %and.i272829, %cmp.i
%and.i263233 = and <4 x i1> %xor.i3031, %cmp.i23
%and.i253435 = and <4 x i1> %xor.i3031, %cmp.i21
%3 = bitcast <2 x i64> %src to <4 x i32>
%4 = select <4 x i1> %and.i272829, <4 x i32> %3, <4 x i32> zeroinitializer
%5 = select <4 x i1> %and.i263233, <4 x i32> %0, <4 x i32> %4
%6 = select <4 x i1> %and.i253435, <4 x i32> %1, <4 x i32> %5
%7 = bitcast <4 x i32> %6 to <2 x i64>
ret <2 x i64> %7
}
This behavior makes sense because aValid
and bValid
are both used twice and so the transformation is skipped. But I'm uncertain whether this is a reasonable change, and it's fairly brittle because a lot of combining transformations on i1-vectors apparently have this kind of check.
Yes, that was my concern when I reported it - I started looking at the "other end" and whether we could delay the sext/bitcasts further so the logic is performed on vXi1 types, but then got distracted (as usual......).
Or we try a lot harder to remove all the bitcasts to/from vXi64 around the logic - SSE/AVX code suffers from this a lot as the intrinsics (_mm_and_si128 etc.) all cast to/from v2i64.
It could be that this needs to be put inside VectorCombine instead where we're allowed to do costs comparisons.
What if the one-use check were amended to
Cast0Src->getType()->isVectorTy() || Cast0->hasOneUse() || Cast1->hasOneUse()
or even an additional check for an i1 vector width. (Since that case is more likely to benefit from this transformation than scalar code.) For reference, here is the commit where the check was added.
@anematode Removing one use limits doesn't usually work out well
https://godbolt.org/z/n8qYT9hvc
In many cases we can replace SSE pblendvb intrinsics with select nodes, by determining that the condition element is a sign-extended compare result (or logic combination of them).
But in some circumstance the logic fails to simplify and we end up stuck with the pblendvb intrinsics, which prevents further generic folds from occurring.
I don't know if its the endless bitcasts to/from <2 x i64> due to the __m128i type, or if something else is going on.