llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.25k stars 11.67k forks source link

[regalloc remat] Repeated generation of 0 constant #9960

Open jrmuizel opened 13 years ago

jrmuizel commented 13 years ago
Bugzilla Link 9588
Version trunk
OS All
CC @lattner,@topperc,@RKSimon,@rotateright,@stoklund

Extended Description

The following code:

include

include

struct colorant { int32_t X; int32_t Y; int32_t Z; };

struct qcms_profile { struct colorant redColorant; struct colorant blueColorant; struct colorant greenColorant; };

static inline float s15Fixed16Number_to_float(int32_t a) { return ((int32_t)a)/65536.f; }

bool qcms_profile_is_bogus(struct qcms_profile *profile) { float sum[3], target[3], tolerance[3]; float rX, rY, rZ, gX, gY, gZ, bX, bY, bZ; bool negative; unsigned i;

   rX = s15Fixed16Number_to_float(profile->redColorant.X);
   rY = s15Fixed16Number_to_float(profile->redColorant.Y);
   rZ = s15Fixed16Number_to_float(profile->redColorant.Z);

   gX = s15Fixed16Number_to_float(profile->greenColorant.X);
   gY = s15Fixed16Number_to_float(profile->greenColorant.Y);
   gZ = s15Fixed16Number_to_float(profile->greenColorant.Z);

   bX = s15Fixed16Number_to_float(profile->blueColorant.X);
   bY = s15Fixed16Number_to_float(profile->blueColorant.Y);
   bZ = s15Fixed16Number_to_float(profile->blueColorant.Z);

   // Check if any of the XYZ values are negative (see mozilla bug 498245)
   // CIEXYZ tristimulus values cannot be negative according to the spec.
   negative =
       (rX < 0) || (rY < 0) || (rZ < 0) ||
       (gX < 0) || (gY < 0) || (gZ < 0) ||
       (bX < 0) || (bY < 0) || (bZ < 0);

   if (negative)
       return true;

   // All Good
   return false;

}

code compiles (clang -Os -S test.c -o test-clang.s) to:

qcms_profile_is_bogus: # @​qcms_profile_is_bogus movss .LCPI0_0(%rip), %xmm0 cvtsi2ss (%rdi), %xmm1 divss %xmm0, %xmm1 pxor %xmm2, %xmm2 ucomiss %xmm1, %xmm2 movb $1, %al ja .LBB0_9 cvtsi2ss 4(%rdi), %xmm1 divss %xmm0, %xmm1 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 8(%rdi), %xmm1 divss %xmm0, %xmm1 pxor %xmm2, %xmm2 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 24(%rdi), %xmm1 divss %xmm0, %xmm1 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 28(%rdi), %xmm1 divss %xmm0, %xmm1 pxor %xmm2, %xmm2 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 32(%rdi), %xmm1 divss %xmm0, %xmm1 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 12(%rdi), %xmm1 divss %xmm0, %xmm1 pxor %xmm2, %xmm2 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 16(%rdi), %xmm1 divss %xmm0, %xmm1 ucomiss %xmm1, %xmm2 ja .LBB0_9 cvtsi2ss 20(%rdi), %xmm0 divss .LCPI0_0(%rip), %xmm0 pxor %xmm1, %xmm1 ucomiss %xmm0, %xmm1 seta %al .LBB0_9: # %lor.end movzbl %al, %eax ret

This repeatedly pxor's a register to get a 0. It seems like we should be able to preserve the 0 for the duration of the function.

RKSimon commented 5 years ago

This looks like we still have duplicate zeros that should be merged:

    xorps   %xmm3, %xmm3  <- scalar f32 zero
    xorps   %xmm4, %xmm4  <- vector v4f32 zero
    ucomiss %xmm2, %xmm4
    seta    %cl
    cmpltps %xmm3, %xmm1
    cmpltps %xmm3, %xmm0
jrmuizel commented 5 years ago

This current code is different enough that this can probably be closed.

.LCPI0_0: .long 931135488 # float 1.52587891E-5 .long 931135488 # float 1.52587891E-5 .long 931135488 # float 1.52587891E-5 .long 931135488 # float 1.52587891E-5 .LCPI0_1: .long 931135488 # float 1.52587891E-5 qcms_profile_is_bogus(qcms_profile): # @​qcms_profile_is_bogus(qcms_profile) movsd (%rdi), %xmm0 # xmm0 = mem[0],zero movss 8(%rdi), %xmm1 # xmm1 = mem[0],zero,zero,zero movaps %xmm0, %xmm2 shufps $132, %xmm1, %xmm2 # xmm2 = xmm2[0,1],xmm1[0,2] movss 24(%rdi), %xmm1 # xmm1 = mem[0],zero,zero,zero shufps $32, %xmm2, %xmm1 # xmm1 = xmm1[0,0],xmm2[2,0] shufps $36, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2,0] movsd 12(%rdi), %xmm1 # xmm1 = mem[0],zero movsd 28(%rdi), %xmm2 # xmm2 = mem[0],zero movlhps %xmm1, %xmm2 # xmm2 = xmm2[0],xmm1[0] cvtdq2ps %xmm2, %xmm1 cvtdq2ps %xmm0, %xmm0 movaps .LCPI0_0(%rip), %xmm2 # xmm2 = [1.52587891E-5,1.52587891E-5,1.52587891E-5,1.52587891E-5] mulps %xmm2, %xmm0 mulps %xmm2, %xmm1 xorps %xmm2, %xmm2 cvtsi2ssl 20(%rdi), %xmm2 mulss .LCPI0_1(%rip), %xmm2 xorps %xmm3, %xmm3 xorps %xmm4, %xmm4 ucomiss %xmm2, %xmm4 seta %cl cmpltps %xmm3, %xmm1 cmpltps %xmm3, %xmm0 packssdw %xmm1, %xmm0 packsswb %xmm0, %xmm0 pmovmskb %xmm0, %eax testb %al, %al setne %al orb %cl, %al retq

jrmuizel commented 12 years ago

I ran the same test again and got:

25362432 2592768 0 25178112 53133312 32ac000 XUL.before 25358336 2592768 0 25178112 53129216 32ab000 XUL.after

So an improvement this time.

It seems like this change is probably safe to try more widely.

jrmuizel commented 12 years ago

As some additional data I did a build of firefox with the following code commented out:

// Heuristics #​1: Don't CSE "cheap" computation if the def is not local or in // an immediate predecessor. We don't want to increase register pressure and // end up causing other computation to be spilled. if (MI->getDesc().isAsCheapAsAMove()) { MachineBasicBlock CSBB = CSMI->getParent(); MachineBasicBlock BB = MI->getParent(); if (CSBB != BB && !CSBB->isSuccessor(BB)) return false; }

size before: 25227264 size after: 25235456

I haven't looked into why there's an increase in code size yet. On some smaller examples, like http://code.google.com/p/double-conversion/source/browse/src/fast-dtoa.cc this produced an improvement in code size.

1ba3d143-a64b-4671-82b2-0b31cfb91709 commented 13 years ago

Live range splitting is working with the greedy register allocator. It won't happen for linear scan.

LLVM 3.0 will ship with linear scan as an option, so I would prefer to avoid introducing known regressions until 3.0 has been branched.

If you want to measure the performance impact of CSE'ing cheap instructions with greedy regalloc, that would be welcome data.

jrmuizel commented 13 years ago

I know a bunch of stuff has changed with the register allocator, but has live range splitting been enabled yet?

1ba3d143-a64b-4671-82b2-0b31cfb91709 commented 13 years ago

I believe most cheap-as-a-move instructions are rematerializable, so that would effectively disable the heuristic.

Evan added these heuristics in r98121 because CSE was causing regressions:

Note these are heuristics so machine cse doesn't make register allocator unhappy. Once we have proper live range splitting and re-materialization support in place, these should be evaluated again. Now machine cse is almost always a win on llvm nightly tests on x86 and x86_64.

I agree this should be revisited when live range splitting is enabled.

You are right that this instruction would be rematerialized, but only if it is selected for spilling. The CSE'ed live range would get a low spill weight initially, but I can imagine cases where coalescing would bump up the weight because of two-address instructions.

lattner commented 13 years ago

We can already remat this instruction IIUC, should the heuristic explicitly allow rematable instructions that are as cheap as a move?

1ba3d143-a64b-4671-82b2-0b31cfb91709 commented 13 years ago

From MachineCSE:

// Heuristics #​1: Don't CSE "cheap" computation if the def is not local or in // an immediate predecessor. We don't want to increase register pressure and // end up causing other computation to be spilled. if (MI->getDesc().isAsCheapAsAMove()) { MachineBasicBlock CSBB = CSMI->getParent(); MachineBasicBlock BB = MI->getParent(); if (CSBB != BB && !CSBB->isSuccessor(BB)) return false; }

Once the register allocator learns to undo CSE, we can enable this optimization.