Open jrmuizel opened 13 years ago
This looks like we still have duplicate zeros that should be merged:
xorps %xmm3, %xmm3 <- scalar f32 zero
xorps %xmm4, %xmm4 <- vector v4f32 zero
ucomiss %xmm2, %xmm4
seta %cl
cmpltps %xmm3, %xmm1
cmpltps %xmm3, %xmm0
This current code is different enough that this can probably be closed.
.LCPI0_0: .long 931135488 # float 1.52587891E-5 .long 931135488 # float 1.52587891E-5 .long 931135488 # float 1.52587891E-5 .long 931135488 # float 1.52587891E-5 .LCPI0_1: .long 931135488 # float 1.52587891E-5 qcms_profile_is_bogus(qcms_profile): # @qcms_profile_is_bogus(qcms_profile) movsd (%rdi), %xmm0 # xmm0 = mem[0],zero movss 8(%rdi), %xmm1 # xmm1 = mem[0],zero,zero,zero movaps %xmm0, %xmm2 shufps $132, %xmm1, %xmm2 # xmm2 = xmm2[0,1],xmm1[0,2] movss 24(%rdi), %xmm1 # xmm1 = mem[0],zero,zero,zero shufps $32, %xmm2, %xmm1 # xmm1 = xmm1[0,0],xmm2[2,0] shufps $36, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2,0] movsd 12(%rdi), %xmm1 # xmm1 = mem[0],zero movsd 28(%rdi), %xmm2 # xmm2 = mem[0],zero movlhps %xmm1, %xmm2 # xmm2 = xmm2[0],xmm1[0] cvtdq2ps %xmm2, %xmm1 cvtdq2ps %xmm0, %xmm0 movaps .LCPI0_0(%rip), %xmm2 # xmm2 = [1.52587891E-5,1.52587891E-5,1.52587891E-5,1.52587891E-5] mulps %xmm2, %xmm0 mulps %xmm2, %xmm1 xorps %xmm2, %xmm2 cvtsi2ssl 20(%rdi), %xmm2 mulss .LCPI0_1(%rip), %xmm2 xorps %xmm3, %xmm3 xorps %xmm4, %xmm4 ucomiss %xmm2, %xmm4 seta %cl cmpltps %xmm3, %xmm1 cmpltps %xmm3, %xmm0 packssdw %xmm1, %xmm0 packsswb %xmm0, %xmm0 pmovmskb %xmm0, %eax testb %al, %al setne %al orb %cl, %al retq
I ran the same test again and got:
25362432 2592768 0 25178112 53133312 32ac000 XUL.before 25358336 2592768 0 25178112 53129216 32ab000 XUL.after
So an improvement this time.
It seems like this change is probably safe to try more widely.
As some additional data I did a build of firefox with the following code commented out:
// Heuristics #1: Don't CSE "cheap" computation if the def is not local or in // an immediate predecessor. We don't want to increase register pressure and // end up causing other computation to be spilled. if (MI->getDesc().isAsCheapAsAMove()) { MachineBasicBlock CSBB = CSMI->getParent(); MachineBasicBlock BB = MI->getParent(); if (CSBB != BB && !CSBB->isSuccessor(BB)) return false; }
size before: 25227264 size after: 25235456
I haven't looked into why there's an increase in code size yet. On some smaller examples, like http://code.google.com/p/double-conversion/source/browse/src/fast-dtoa.cc this produced an improvement in code size.
Live range splitting is working with the greedy register allocator. It won't happen for linear scan.
LLVM 3.0 will ship with linear scan as an option, so I would prefer to avoid introducing known regressions until 3.0 has been branched.
If you want to measure the performance impact of CSE'ing cheap instructions with greedy regalloc, that would be welcome data.
I know a bunch of stuff has changed with the register allocator, but has live range splitting been enabled yet?
I believe most cheap-as-a-move instructions are rematerializable, so that would effectively disable the heuristic.
Evan added these heuristics in r98121 because CSE was causing regressions:
Note these are heuristics so machine cse doesn't make register allocator unhappy. Once we have proper live range splitting and re-materialization support in place, these should be evaluated again. Now machine cse is almost always a win on llvm nightly tests on x86 and x86_64.
I agree this should be revisited when live range splitting is enabled.
You are right that this instruction would be rematerialized, but only if it is selected for spilling. The CSE'ed live range would get a low spill weight initially, but I can imagine cases where coalescing would bump up the weight because of two-address instructions.
We can already remat this instruction IIUC, should the heuristic explicitly allow rematable instructions that are as cheap as a move?
From MachineCSE:
// Heuristics #1: Don't CSE "cheap" computation if the def is not local or in // an immediate predecessor. We don't want to increase register pressure and // end up causing other computation to be spilled. if (MI->getDesc().isAsCheapAsAMove()) { MachineBasicBlock CSBB = CSMI->getParent(); MachineBasicBlock BB = MI->getParent(); if (CSBB != BB && !CSBB->isSuccessor(BB)) return false; }
Once the register allocator learns to undo CSE, we can enable this optimization.
Extended Description
The following code:
include
include
struct colorant { int32_t X; int32_t Y; int32_t Z; };
struct qcms_profile { struct colorant redColorant; struct colorant blueColorant; struct colorant greenColorant; };
static inline float s15Fixed16Number_to_float(int32_t a) { return ((int32_t)a)/65536.f; }
bool qcms_profile_is_bogus(struct qcms_profile *profile) { float sum[3], target[3], tolerance[3]; float rX, rY, rZ, gX, gY, gZ, bX, bY, bZ; bool negative; unsigned i;
}
code compiles (clang -Os -S test.c -o test-clang.s) to:
qcms_profile_is_bogus: # @qcms_profile_is_bogus movss .LCPI0_0(%rip), %xmm0 cvtsi2ss (%rdi), %xmm1 divss %xmm0, %xmm1 pxor %xmm2, %xmm2 ucomiss %xmm1, %xmm2 movb $1, %al ja .LBB0_9 cvtsi2ss 4(%rdi), %xmm1 divss %xmm0, %xmm1 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 8(%rdi), %xmm1 divss %xmm0, %xmm1 pxor %xmm2, %xmm2 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 24(%rdi), %xmm1 divss %xmm0, %xmm1 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 28(%rdi), %xmm1 divss %xmm0, %xmm1 pxor %xmm2, %xmm2 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 32(%rdi), %xmm1 divss %xmm0, %xmm1 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 12(%rdi), %xmm1 divss %xmm0, %xmm1 pxor %xmm2, %xmm2 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 16(%rdi), %xmm1 divss %xmm0, %xmm1 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 20(%rdi), %xmm0 divss .LCPI0_0(%rip), %xmm0 pxor %xmm1, %xmm1 ucomiss %xmm0, %xmm1 seta %al .LBB0_9: # %lor.end movzbl %al, %eax ret
This repeatedly pxor's a register to get a 0. It seems like we should be able to preserve the 0 for the duration of the function.