missed optimization: b &= ~a

edwintorok commented 16 years ago


Bugzilla Link	2191
Resolution	FIXED
Resolved on	Jan 21, 2009 00:23
Version	unspecified
OS	Linux
CC	@asl,@isanbard,@sunfishcode,@nlewycky

Extended Description

Using llvm-gcc SVN r49213 on the simple testcase below generates code with extra mov instructions.

unsigned sum(unsigned a, unsigned b) { do { a ^= b; b &= ~a; } while (b <<= 1); return a; }

This is the assembly generated, note the extra movl, and testl after the and:

sum: .align 16 .LBB1_1: # bb xorl %esi, %edi movl %edi, %eax xorl $2147483647, %eax andl %esi, %eax addl %eax, %eax testl %eax, %eax movl %eax, %esi jne .LBB1_1 # bb .LBB1_2: # bb12 movl %edi, %eax ret .size sum, .-sum

This should be a more optimized version (maybe xor could be changed to a not, not sure): sum: .align 16 .LBB1_1: # bb xorl %esi, %edi movl %edi, %eax xorl $2147483647, %eax andl %eax, %esi addl %esi, %esi jne .LBB1_1 # bb .LBB1_2: # bb12 movl %edi, %eax ret .size sum, .-sum

FWIW, gcc doesn't generate the optimized version either, here is what I get with gcc-4.3 -O3: sum: .align 16 .LBB1_1: # bb xorl %esi, %edi movl %edi, %eax xorl $2147483647, %eax andl %esi, %eax addl %eax, %eax testl %eax, %eax movl %eax, %esi jne .LBB1_1 # bb .LBB1_2: # bb12 movl %edi, %

lattner commented 15 years ago

Are you sure evan? I'm getting this code with -m64 -fomit-frame-pointer and a build from yesterday:

_sum: Leh_func_begin1: Llabel1: .align 4,0x90 LBB1_1: ## bb xorl %esi, %edi movl %edi, %eax notl %eax andl %esi, %eax addl %eax, %eax testl %eax, %eax movl %eax, %esi jne LBB1_1 ## bb LBB1_2: ## bb2 movl %edi, %eax ret

it has two copies in the loop.

llvmbot commented 15 years ago

This has been fixed a while back.

llvmbot commented 16 years ago

bb: 24 %ESI = MOV32rr %ESI 28 %EDI = MOV32rr %EDI 32 %EDI = MOV32rr %EDI 36 %EDI = XOR32rr %EDI, %ESI, %EFLAGS<imp-def,dead> 40 %reg1033 = MOV32rr %EDI 44 %reg1033 = NOT32r %reg1033 48 %reg1033 = MOV32rr %reg1033 52 %reg1033 = AND32rr %reg1033, %ESI, %EFLAGS<imp-def,dead> 56 %reg1033 = MOV32rr %reg1033 60 %reg1033 = ADD32rr %reg1033, %reg1033, %EFLAGS<imp-def,dead> 64 TEST32rr %reg1033, %reg1033, %EFLAGS 68 %ESI = MOV32rr %reg1033 72 %EDI = MOV32rr %EDI 76 JNE mbb<bb,0x1404780>, %EFLAGS<imp-use,kill>

Now we want to coalesce the copy instruction at 68. The def of this reg1033 is 60 %reg1033 = ADD32rr %reg1033, %reg1033, %EFLAGS<imp-def,dead>

Of course, commuting this ADD32rr is useless because its "other" operand is also reg1033. The coalescer should be able to see this and commute the following instruction instead: 52 %reg1033 = AND32rr %reg1033, %ESI, %EFLAGS<imp-def,dead>

llvmbot commented 16 years ago

The extra mov is due to a coalescer deficiency. This will require extending RemoveCopyByCommutingDef in SimpleRegisterCoalescing.cpp. It's not an entirely trivial change.

edwintorok commented 16 years ago

This patch looks good to me, plz commit.

committed in SVN r49280. Now these 2 issues remain: http://llvm.org/bugs/show_bug.cgi?id=2191#c3 http://llvm.org/bugs/show_bug.cgi?id=2191#c4

lattner commented 16 years ago

This patch looks good to me, plz commit.

edwintorok commented 16 years ago

5) to check to see if all bits are set, use Expanded.isAllOnesValue() instead of getActiveBits.

I used getActiveBits on purpose, because I wanted to handle 16-bit, and 8-bit case too. Or is the APInt already of correct width?

I added 8-bit, and 16-bit testcases and they work with isAllOnesValue() too. Thanks for the suggestion.

edwintorok commented 16 years ago

updated patch I have tested using make check, and llvm-test on Multisource/Benchmarks.

llvm-test fails on tramp3d-v4, employ, city, health, security-rijndael, smg2000.

make check has 5 failures (vec_shuffle-16.ll, sret2.c, sret.c, vec_shuffle-16.ll, 2008-01-25-ByValReadNone.c, 2007-05-07-NestedStructReturn.c).

The make check failures don't seem to be related, but i will test tomorrow with a build that doesn't have my patch to see if I get same failures.

lattner commented 16 years ago

Ok, since it should be XOR specific, please just inline it into the XOr handling code, thanks!

edwintorok commented 16 years ago

This patch looks fine to me, please apply with these changes:

1) make sure it fits in 80 cols. 2) Please use doxygen comments "///" consistently:

/// we can force undemanded bits in the constant to 1.

// If by doing so we get all bits set, force the bits, and return true.

3) Please put a space between if( and //comment:

//if we can expand it to have all bits set, do it

if(Expan

Ok.

4) strange indentation:

}

}

I don't know what happened there, I probably forgot to reindent the last line :)

5) to check to see if all bits are set, use Expanded.isAllOnesValue() instead of getActiveBits.

I used getActiveBits on purpose, because I wanted to handle 16-bit, and 8-bit case too. Or is the APInt already of correct width?

6) Funny indentation:

SDOperand New = DAG.getNode(Op.getOpcode(), VT, Op.getOperand(0),

DAG.getConstant(Expanded, VT));

will fix.

7) It's strange that ExpandOrShrinkDemandedConstant handles AND, OR, and XOR, but you only call it from the XOR path. Should it be called from the OR/AND path also?

Good point, it should only handle XOR. Expanding AND/OR operands would only harm, because the instruction would probably be larger (as no. of bytes), so shrinking is perfect for and/or.

Also, don't worry about the "not with memory" case: that code cannot be improved. The code the isel is generating is fine.

Ok, I will adjust my testcase accordingly. I'll also add a situation that must not be transformed into a not, and would be incorrect if that would happen (such as xor 55aa, ...)

Finally, please run tests before you commit it.

I will run tests on (part of) llvm-test.

Thanks Edwin!

Thanks for the feedback!

lattner commented 16 years ago

This patch looks fine to me, please apply with these changes:

1) make sure it fits in 80 cols. 2) Please use doxygen comments "///" consistently:

/// we can force undemanded bits in the constant to 1.
// If by doing so we get all bits set, force the bits, and return true.

3) Please put a space between if( and //comment:

//if we can expand it to have all bits set, do it
if(Expan

4) strange indentation:

}
}

5) to check to see if all bits are set, use Expanded.isAllOnesValue() instead of getActiveBits.

6) Funny indentation:

SDOperand New = DAG.getNode(Op.getOpcode(), VT, Op.getOperand(0),
DAG.getConstant(Expanded, VT));

7) It's strange that ExpandOrShrinkDemandedConstant handles AND, OR, and XOR, but you only call it from the XOR path. Should it be called from the OR/AND path also?

Also, don't worry about the "not with memory" case: that code cannot be improved. The code the isel is generating is fine. Finally, please run tests before you commit it.

Thanks Edwin!

edwintorok commented 16 years ago

transform xor into not if possible Currently works for 1/2 testcases in xor_not.ll. Suggestions on how to get the other one to work?

It won't recognize that this is -1, and can be turned into a not: movl $4294967295, %eax xorl 8(%esp), %eax

edwintorok commented 16 years ago

Here is a reduced testcase for the xor/not issue:

unsigned test(unsigned a, unsigned b) { return (a & ~b) >> 1 ; }

with -m64:

_test: xorl $4294967294, %esi ;; Should be "not %esi" andl %edi, %esi movl %esi, %eax shrl %eax ret

The problem is that simplify demanded bits shrinks the "~" mask, dropping bits. We have a solution for this for and/or, but not for xor yet. Adding this to tblgen should be really easy and will shrink codesize if nothing else. To see where this happens, look at "CheckAndMask" and "CheckOrMask" in the generated DAG isel.

Writing a CheckXorMask is hard, because in order for it to be useful, we need demandedBits info, which is not available in SelectionDAGISel.

However I fixed this in SimplifyDemandedBits itself, there was a comment: FIXME: for XOR, we prefer to force bits to 1 if they will make a -1

I introduced an ExpandOrShrinkDemandedConstant method, that tries to expand the constant to all bits 1, if it can't do that, it'll try to shrink.

Both testcases generate a notl now on x86-64, but on x86-32 only one generates a notl.

Attaching patch. To make it work on both testcases it will probably need a pattern in X86InstrInfo.td?

edwintorok commented 16 years ago

There is another missed optimization here too, a redundant movl.

xorl %esi, %edi movl %edi, %eax xorl $2147483647, %eax andl %esi, %eax addl %eax, %eax testl %eax, %eax movl %eax, %esi jne .LBB1_1 # bb

Look at these 2 instructions: andl %esi, %eax ... movl %eax, %esi

It can be written like this, to store directly where it is intended to: andl %eax, %esi

The instructions between them can have renamed registers, so in the end it will be like:

xorl %esi, %edi movl %edi, %eax xorl $2147483647, %eax andl %eax, %esi addl %esi, %esi testl %esi, %esi jne .LBB1_1 # bb

Optimizing xorl, and the redundant movl are the other 2 steps you already mentioned.

lattner commented 16 years ago

Also, the "redundant test" issue is well known, and already in lib/Target/X86/README.txt

lattner commented 16 years ago

Here is a reduced testcase for the xor/not issue:

unsigned test(unsigned a, unsigned b) { return (a & ~b) >> 1 ; }

with -m64:

_test: xorl $4294967294, %esi ;; Should be "not %esi" andl %edi, %esi movl %esi, %eax shrl %eax ret

The problem is that simplify demanded bits shrinks the "~" mask, dropping bits. We have a solution for this for and/or, but not for xor yet. Adding this to tblgen should be really easy and will shrink codesize if nothing else. To see where this happens, look at "CheckAndMask" and "CheckOrMask" in the generated DAG isel.

-Chris

edwintorok commented 16 years ago

Using llvm-gcc SVN r49213 on the simple testcase below generates code with extra mov instructions.

unsigned sum(unsigned a, unsigned b) { do { a ^= b; b &= ~a; } while (b <<= 1); return a; }

And of course all this can be written as 'return a+b' :).

llvm / llvm-project

missed optimization: b &= ~a #2563

Extended Description