Open LebedevRI opened 5 years ago
SelectionDAGBuilder inserted a trunc before the shl during DAG construction. This caused DAGCombine to shrink the add -1 to 8-bits because sTypeDesirableForOp didn't tell it not to. Lowering for the icmp created the BT from the and/shl, but needed to extend the register type back to 32-bits to match the register size required for the BT count input.
I'm pretty sure that even though BT only needs 5 or 6 bits for the count, the partial register handling in the frontend and register renaming does track it as an access to the full register 16/32/64-bit register. This means on Sandy Bridge and later, it will trigger an H-register merge if needed. I think the shift amounts for SHLX/SARX/SHRX are the same. The old shift instructions that always use CL will not trigger an H-register merge.
Promote all the things is probably the best way to fix this.
Not sure if it's the same, but I was looking at a similar problem that led me to this bit of X86InstrCompiler.td:
// anyext. Define these to do an explicit zero-extend to // avoid partial-register updates.
To avoid this, the big change that I think we want -- but probably requires many intermediate fixes -- is to promote all 8-bit ops to 32-bit.
Promote all the things is probably the best way to fix this.
Yes, since narrow function args are (unofficially) required to be extended to 32-bit, reading ESI instead of SIL won't cause a partial-register stall. (Which is only a thing on obsolete P6-family anyway; later uarches have false dependencies when writing. And SnB itself has efficient merging).
Of course, if we have efficient SHRX, I think it's better not to go through FLAGS at all, like how we compile for ARM64 (https://godbolt.org/z/a9xxTPjWb) with SUB/LSR/AND:
dec %esi
shrx %esi, %edi, %eax
and $1, %eax
Variable-count shifts without BMI2 are unfortunately more expensive on Intel SnB-family CPUs than bt
, but without -mbmi2
but with -mtune=znver1
(or any earlier AMD that don't have BMI2 like bdver1
or k10
), we could use
lea -1(%rsi), %ecx
mov %edi, %eax
shr %cl, %eax
and $1, %eax
(Or to take the mov %edi, %eax off the critical path, mov $1, %eax
/ and %edi, %eax
, but that costs a few extra bytes of code size.)
There's also the possibility of shifting by 1 more than the count, to shift the bit into CF. But x86 masks the shift count, so a & (1<<(32 - 1))
is well-defined but not equivalent to CF after a >> 32
, because that's the same as a >> 0
which leaves FLAGS unmodified. But if there's a wider operand-size available, we can use that:
mov %esi, %ecx
shr %cl, %rdi # note 64-bit operand-size so a shift count of 31+1 = 32 still works.
setc %al # partial register write false dependency: we haven't written EAX as part of this dep chain
IDK if this could be useful; maybe on AMD. On P6-family, reading a FLAG result after a variable-count shift stalls the front-end until the shift retires, vs. being 3 uops on Sandybridge-family for FLAGS-merging. (That's how they managed to make it 1 uop on P6 while still respecting the fact that shifts with count=0 must leave FLAGS unmodified, annoying CISC semantics that are basically part of the x86 tax for modern implementations).
Extended Description
results in:
That movzbl shouldn't be there.
I'm not quite sure where to look though