Open RKSimon opened 7 years ago
CC'ing @nikic who did something similar for scalar style patterns in D101232
Hi!
This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:
test/
create fine-grained testing targets, so you can e.g. use make check-clang-ast
to only run Clang's AST tests.git clang-format HEAD~1
to format your changes.If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.
@llvm/issue-subscribers-good-first-issue
Author: Simon Pilgrim (RKSimon)
I think the alt_cmpeq_epi64 case at least might be a good first issue to do in instcombine (or vectorcombine)?
define <2 x i64> @alt_cmpeq_epi64(<2 x i64> %a, <2 x i64> %b) {
entry:
%0 = bitcast <2 x i64> %a to <4 x i32>
%1 = bitcast <2 x i64> %b to <4 x i32>
%cmp.i = icmp eq <4 x i32> %0, %1
%sext.i = sext <4 x i1> %cmp.i to <4 x i32>
%permil = shufflevector <4 x i32> %sext.i, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
%and.i3 = select <4 x i1> %cmp.i, <4 x i32> %permil, <4 x i32> zeroinitializer
%and.i = bitcast <4 x i32> %and.i3 to <2 x i64>
ret <2 x i64> %and.i
}
The SELECT might appear as an AND depending on how early we fold.
@RKSimon I would like to tackle this one as my first issue. Please let me know if you have any guidance on how/where to start. Thank you!
I'd probably start at trying to match this in VectorCombine:
define <4 x i32> @alt_cmpeq_epi64(<4 x i32> noundef %a, <4 x i32> noundef %b) {
%cmp.i = icmp eq <4 x i32> %a, %b
%sext.i = sext <4 x i1> %cmp.i to <4 x i32>
%permil = shufflevector <4 x i32> %sext.i, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
%and.i3 = select <4 x i1> %cmp.i, <4 x i32> %permil, <4 x i32> zeroinitializer
ret <4 x i32> %and.i3
}
folding to
define <4 x i32> @cmpeq_epi64(<4 x i32> noundef %a, <4 x i32> noundef %b) {
%bc.i = bitcast <4 x i32> %a to <2 x i64>
%bc.i1 = bitcast <4 x i32> %b to <2 x i64>
%cmp.i = icmp eq <2 x i64> %bc.i, %bc.i1
%sext.i = sext <2 x i1> %cmp.i to <2 x i64>
%and.i3 = bitcast <2 x i64> %sext.i to <4 x i32>
ret <4 x i32> %and.i3
}
@RKSimon I apologize for the delay. I needed to get permission from my work before I could contribute to LLVM. I will continue with my work email (this account).
If I am following correctly, some source code that wants to compare equality of two <2 x i64> vectors instead compares them as two <4 x i32>s because this was before SSE41/SSE42. This is being compiled down in LLVM IR to
define <4 x i32> @alt_cmpeq_epi64(<4 x i32> noundef %a, <4 x i32> noundef %b) {
%cmp.i = icmp eq <4 x i32> %a, %b
%sext.i = sext <4 x i1> %cmp.i to <4 x i32>
%permil = shufflevector <4 x i32> %sext.i, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
%and.i3 = select <4 x i1> %cmp.i, <4 x i32> %permil, <4 x i32> zeroinitializer
ret <4 x i32> %and.i3
}
which then gets compiled down to X86 to something in the form
__m128i alt_cmpeq_epi64(__m128i a, __m128i b) {
__m128i c = _mm_cmpeq_epi32(a, b);
return _mm_and_si128( c, _mm_shuffle_epi32( c, _MM_SHUFFLE(2,3,0,1) ) );
}
Am I only trying to make this specific case not use the v4i32 intrinsic cmpeq_epi32 before it hits codegen? So I should only fold exactly the following case?
%cmp.i = icmp eq <4 x i32> %a, %b
%sext.i = sext <4 x i1> %cmp.i to <4 x i32>
%permil = shufflevector <4 x i32> %sext.i, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
%and.i3 = select <4 x i1> %cmp.i, <4 x i32> %permil, <4 x i32> zeroinitializer
Where did the (2,3,0,1) come from for a simple element-wise equality check of vectors?
Why does cmpeq_epi64 in the LLVM IR take and return <4 x i32> instead of <2 x i64>? I thought alt_cmpeq_epi64 is supposed to compare two v2i64s and returns a v2i64.
How can I test this on the __m128i code to make sure it correctly doesn't use cmpeq_epi32?
Thank you for your help and guidance as I begin my LLVM journey. I am beyond grateful.
There isn't a "cmpeq_epi32" x86 intrinsic in IR - the headers convert it directly to the sext+icmp_eq so you just need to handle that. Occasionally the sext might have been removed by a previous optimization in which case you will see <4 x i1> types - you can match both with m_SExtOrSelf
The _MM_SHUFFLE(2,3,0,1) shuffle immediate translates to the <i32 1, i32 0, i32 3, i32 2>shuffle mask (x86 intrinsics do it backward....) - this is the typical way that coders would swap the odd-even pairs of elements to AND them together.
Some of the bitcasts to/from the <2 x i64>have been stripped from the test case as I was trying to avoid confusion - bear in mind that the __m128i
SSE type is <2 x i64> so you always see a lot of extra bitcasts to/from that type in IR that came from SSE intrinsics.
I'd use Godbolt execution to test the intrinsic patterns (a small loop that creates random numbers for comparison), then once you have your optimization in place build clang and run the test loop locally.
Extended Description
Before SSE41/SSE42, the only way to compare eq/gt v2i64 vectors was to use the v4i32 intrinsics:
Resulting in quite a bit of legacy code that still uses this (I've only seen this in __m128i code). We should be trying to simplify this where possible.