[X86][SSE] Recognise v2i64 comparison patterns

RKSimon commented 7 years ago


Bugzilla Link	33614
Version	trunk
OS	Windows NT
CC	@nikic,@rotateright

Extended Description

Before SSE41/SSE42, the only way to compare eq/gt v2i64 vectors was to use the v4i32 intrinsics:

__m128i alt_cmpeq_epi64(__m128i a, __m128i b) {
  __m128i c = _mm_cmpeq_epi32(a, b);
  return _mm_and_si128( c, _mm_shuffle_epi32( c, _MM_SHUFFLE(2,3,0,1) ) );
}

__m128i alt_cmpgt_epi64(__m128i a, __m128i b) {
  __m128i flip = _mm_setr_epi32( 0x80000000,0x00000000,0x80000000,0x00000000 );
  a = _mm_xor_si128( a, flip );
  b = _mm_xor_si128( b, flip );
  __m128i gt = _mm_cmpgt_epi32( a, b );
  __m128i gt0 = _mm_shuffle_epi32( gt, _MM_SHUFFLE(2,2,0,0) );
  __m128i gt1 = _mm_shuffle_epi32( gt, _MM_SHUFFLE(3,3,1,1) );
  __m128i eq = _mm_cmpeq_epi32( a, b );
  __m128i eq0 = _mm_shuffle_epi32( eq, _MM_SHUFFLE(3,3,1,1) );
  return _mm_or_si128( _mm_and_si128( gt0, eq0 ), gt1 );
}

__m128i alt_cmpgt_epu64(__m128i a, __m128i b) {
  __m128i flip = _mm_set1_epi32( 0x80000000 );
  a = _mm_xor_si128( a, flip );
  b = _mm_xor_si128( b, flip );
  __m128i gt = _mm_cmpgt_epi32( a, b );
  __m128i gt0 = _mm_shuffle_epi32( gt, _MM_SHUFFLE(2,2,0,0) );
  __m128i gt1 = _mm_shuffle_epi32( gt, _MM_SHUFFLE(3,3,1,1) );
  __m128i eq = _mm_cmpeq_epi32( a, b );
  __m128i eq0 = _mm_shuffle_epi32( eq, _MM_SHUFFLE(3,3,1,1) );
  return _mm_or_si128( _mm_and_si128( gt0, eq0 ), gt1 );
}

Resulting in quite a bit of legacy code that still uses this (I've only seen this in __m128i code). We should be trying to simplify this where possible.

RKSimon commented 3 years ago

CC'ing @nikic who did something similar for scalar style patterns in D101232

llvmbot commented 8 months ago

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

In the comments of the issue, request for it to be assigned to you.
Fix the issue locally.
Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
Create a Git commit.
Run git clang-format HEAD~1 to format your changes.
Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

llvmbot commented 8 months ago

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

| | | | --- | --- | | Bugzilla Link | [33614](https://llvm.org/bz33614) | | Version | trunk | | OS | Windows NT | | CC | @nikic,@rotateright | ## Extended Description Before SSE41/SSE42, the only way to compare eq/gt v2i64 vectors was to use the v4i32 intrinsics: ``` __m128i alt_cmpeq_epi64(__m128i a, __m128i b) { __m128i c = _mm_cmpeq_epi32(a, b); return _mm_and_si128( c, _mm_shuffle_epi32( c, _MM_SHUFFLE(2,3,0,1) ) ); } __m128i alt_cmpgt_epi64(__m128i a, __m128i b) { __m128i flip = _mm_setr_epi32( 0x80000000,0x00000000,0x80000000,0x00000000 ); a = _mm_xor_si128( a, flip ); b = _mm_xor_si128( b, flip ); __m128i gt = _mm_cmpgt_epi32( a, b ); __m128i gt0 = _mm_shuffle_epi32( gt, _MM_SHUFFLE(2,2,0,0) ); __m128i gt1 = _mm_shuffle_epi32( gt, _MM_SHUFFLE(3,3,1,1) ); __m128i eq = _mm_cmpeq_epi32( a, b ); __m128i eq0 = _mm_shuffle_epi32( eq, _MM_SHUFFLE(3,3,1,1) ); return _mm_or_si128( _mm_and_si128( gt0, eq0 ), gt1 ); } __m128i alt_cmpgt_epu64(__m128i a, __m128i b) { __m128i flip = _mm_set1_epi32( 0x80000000 ); a = _mm_xor_si128( a, flip ); b = _mm_xor_si128( b, flip ); __m128i gt = _mm_cmpgt_epi32( a, b ); __m128i gt0 = _mm_shuffle_epi32( gt, _MM_SHUFFLE(2,2,0,0) ); __m128i gt1 = _mm_shuffle_epi32( gt, _MM_SHUFFLE(3,3,1,1) ); __m128i eq = _mm_cmpeq_epi32( a, b ); __m128i eq0 = _mm_shuffle_epi32( eq, _MM_SHUFFLE(3,3,1,1) ); return _mm_or_si128( _mm_and_si128( gt0, eq0 ), gt1 ); } ``` Resulting in quite a bit of legacy code that still uses this (I've only seen this in __m128i code). We should be trying to simplify this where possible.

RKSimon commented 8 months ago

I think the alt_cmpeq_epi64 case at least might be a good first issue to do in instcombine (or vectorcombine)?

define <2 x i64> @alt_cmpeq_epi64(<2 x i64> %a, <2 x i64> %b) {
entry:
  %0 = bitcast <2 x i64> %a to <4 x i32>
  %1 = bitcast <2 x i64> %b to <4 x i32>
  %cmp.i = icmp eq <4 x i32> %0, %1
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %permil = shufflevector <4 x i32> %sext.i, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
  %and.i3 = select <4 x i1> %cmp.i, <4 x i32> %permil, <4 x i32> zeroinitializer
  %and.i = bitcast <4 x i32> %and.i3 to <2 x i64>
  ret <2 x i64> %and.i
}

The SELECT might appear as an AND depending on how early we fold.

agamkohli9 commented 8 months ago

@RKSimon I would like to tackle this one as my first issue. Please let me know if you have any guidance on how/where to start. Thank you!

RKSimon commented 8 months ago

I'd probably start at trying to match this in VectorCombine:

define <4 x i32> @alt_cmpeq_epi64(<4 x i32> noundef %a, <4 x i32> noundef %b) {
  %cmp.i = icmp eq <4 x i32> %a, %b
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %permil = shufflevector <4 x i32> %sext.i, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
  %and.i3 = select <4 x i1> %cmp.i, <4 x i32> %permil, <4 x i32> zeroinitializer
  ret <4 x i32> %and.i3
}

folding to

define <4 x i32> @cmpeq_epi64(<4 x i32> noundef %a, <4 x i32> noundef %b) {
  %bc.i = bitcast <4 x i32> %a to <2 x i64>
  %bc.i1 = bitcast <4 x i32> %b to <2 x i64>
  %cmp.i = icmp eq <2 x i64> %bc.i, %bc.i1
  %sext.i = sext <2 x i1> %cmp.i to <2 x i64>
  %and.i3 = bitcast <2 x i64> %sext.i to <4 x i32>
  ret <4 x i32> %and.i3
}

agakoh01 commented 7 months ago

@RKSimon I apologize for the delay. I needed to get permission from my work before I could contribute to LLVM. I will continue with my work email (this account).

If I am following correctly, some source code that wants to compare equality of two <2 x i64> vectors instead compares them as two <4 x i32>s because this was before SSE41/SSE42. This is being compiled down in LLVM IR to

define <4 x i32> @alt_cmpeq_epi64(<4 x i32> noundef %a, <4 x i32> noundef %b) {
  %cmp.i = icmp eq <4 x i32> %a, %b
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %permil = shufflevector <4 x i32> %sext.i, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
  %and.i3 = select <4 x i1> %cmp.i, <4 x i32> %permil, <4 x i32> zeroinitializer
  ret <4 x i32> %and.i3
}

which then gets compiled down to X86 to something in the form

__m128i alt_cmpeq_epi64(__m128i a, __m128i b) {
  __m128i c = _mm_cmpeq_epi32(a, b);
  return _mm_and_si128( c, _mm_shuffle_epi32( c, _MM_SHUFFLE(2,3,0,1) ) );
}

Am I only trying to make this specific case not use the v4i32 intrinsic cmpeq_epi32 before it hits codegen? So I should only fold exactly the following case?

%cmp.i = icmp eq <4 x i32> %a, %b
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %permil = shufflevector <4 x i32> %sext.i, <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
  %and.i3 = select <4 x i1> %cmp.i, <4 x i32> %permil, <4 x i32> zeroinitializer

Where did the (2,3,0,1) come from for a simple element-wise equality check of vectors?

Why does cmpeq_epi64 in the LLVM IR take and return <4 x i32> instead of <2 x i64>? I thought alt_cmpeq_epi64 is supposed to compare two v2i64s and returns a v2i64.

How can I test this on the __m128i code to make sure it correctly doesn't use cmpeq_epi32?

Thank you for your help and guidance as I begin my LLVM journey. I am beyond grateful.

RKSimon commented 7 months ago

There isn't a "cmpeq_epi32" x86 intrinsic in IR - the headers convert it directly to the sext+icmp_eq so you just need to handle that. Occasionally the sext might have been removed by a previous optimization in which case you will see <4 x i1> types - you can match both with m_SExtOrSelf

The _MM_SHUFFLE(2,3,0,1) shuffle immediate translates to the <i32 1, i32 0, i32 3, i32 2>shuffle mask (x86 intrinsics do it backward....) - this is the typical way that coders would swap the odd-even pairs of elements to AND them together.

Some of the bitcasts to/from the <2 x i64>have been stripped from the test case as I was trying to avoid confusion - bear in mind that the __m128i SSE type is <2 x i64> so you always see a lot of extra bitcasts to/from that type in IR that came from SSE intrinsics.

I'd use Godbolt execution to test the intrinsic patterns (a small loop that creates random numbers for comparison), then once you have your optimization in place build clang and run the test loop locally.

llvm / llvm-project

[X86][SSE] Recognise v2i64 comparison patterns #32961

Extended Description