feature rq: track "cold" vector registers for use as don't-care sources to avoid false dependencies

llvmbot commented 7 years ago


Bugzilla Link	32863
Version	trunk
OS	All
Attachments	int-float-test-cases.c
Reporter	LLVM Bugzilla Contributor
CC	@RKSimon,@rotateright

Extended Description

(This is a summary / rewrite of what I wrote while having this idea on an old closed bug: llvm/llvm-project#22398 #c11. See that for some Haswell perf analysis of scalar int->FP conversion.)

x86 has several cases of inconvenient input-dependencies, either for scalar stuff in vector regs or for stuff like generating a vector of all-ones on CPUs that don't recognize PCMPEQD same,same as independent of its inputs. The usual solution is to break dependencies with pxor same,same before doing something, or to guess / hope that a register unused by this function is safe to use.

But with AVX 3-operand instructions, we can use a different strategy: reuse a known-safe register as the don't-care input without destroying it.

Such a register doesn't have to have been xor-zeroed; it can be holding a loop-invariant constant. Or we can vpxor one such register and reuse it for the rest of the function (or until we make a function call, which could return with OOO execution still chewing through a long dep chain on that register).

The use cases where having a safe read-only register helps include:

vcvtsi2ss/sd %r64,%merge_into, %xmm destination # badly-designed instruction
vsqrtss (mem),%merge_into, %xmm
vpcmpeqd %same,%same, %dest # false dep on KNL / Silvermont
vcmptrueps %same,%same, %ymm # splat -1 without AVX2. false dep on all known uarches
Maybe the some weird shuffle use-cases?

The most important / common one by far is int->float conversion, due to Intel's short-sighted design of SSE, and decision to keep that behaviour in the AVX versions. (good for consistency, bad for performance). Anyway, hoisting a VXORPS out of a loop that includes a vcvtsi2sd is an obvious win.

clang already sort of does this for int->float conversions: AFAICT it picks a register unused in the function, and gambles that it is cold. This is a reasonable strategy, but it falls apart under register pressure. (And more sophisticated tracking can also avoid gambling that a caller or callee didn't leave a register at the end of a long dep chain independent from the int->float conversion we're doing. e.g. near the end of a dep chain that includes a cache-miss or a loop accumulator.) Although perhaps this gamble is still worth the code-size savings from leaving out a lot of vxorps instructions.

If you use up all the xmm regs with constants, then clang will put a vxorps-zeroing instruction into the loop and then replace it with a constant. Better would be to simply use one of the constants as the merge-dest for vcvtsi2sd. The x86-64 SysV ABI (and I assume other ABIs) allows passing scalar float/double args with garbage (not zeros) in the high bytes, and this is already something that happens in practice, so there's no reason to worry about not "cleaning up" the results of this before making function calls.

See this test-case on godbolt for clang trunk 301740, gcc8 20170429, icc17, and MSVC CL19 2017. With 16 constants needed, clang keeps two of them in memory since it uses two scratch regs for no reason/benefit. (And has a vxorps in the loop).

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571 which I reported just about the int->float part of this issue.

Related cases: non-read-only, with a false dep on the output: tracking not-recently-modified registers can let us pick a not-recently-modified register to clobber, and/or decide whether to use a vpxor-zeroing instruction depending on how long ago the last modification was of the register we picked. e.g. after one loop, before another loop, all the dead constant registers from the finished loop are safe to read.

all of the above without AVX, where dst=src2.
vpternlogd $0xff,any,any, src3/dst # zmm splat -1: false dep on the dst which is also a 3rd source reg. All 3 vectors are inputs, so we need a stale reg we can clobber (or a vpxor dep-breaker). Hardware could avoid this by treating imm8=0xFF as as special case, but neither KNL nor skylake-avx512 do. (I checked skylake-avx512 on a google-compute-engine VM: definitely a false dep: it runs about twice as fast when adding a vpxor to the loop. Appears to be something like 1c latency, one per 0.5c throughput).

vsqrtss with a register (not memory) source can use src,src,dest with AVX, avoiding the false dependency that src,dest,dest has. (clang4.0 gets this right; 3.9.1 and earlier are like gcc and use vsqrtss %xmm1,%xmm0,%xmm0). ICC uses vsqrtss %xmm1,%xmm1,%xmm1 and then vmovaps.) int->float conversion with vcvtsi2ss can't use this trick because the source operands aren't both vector regs.

If we had such a readyness/coldness/dep-chain tracking infrastructure, _mm_undefined_ps() could take advantage of it to make a good choice for which dead register to pick. (And whether to dep-break it.) This is useful for things like a horizontal-sum function that wants to use MOVHLPS to avoid extra MOVAPS instructions when extracting the high half of a vector with only SSE2. (With SSE3, MOVSHDUP is a great first-step as an FP copy+shuffle. Then you can use the original __m128 C variable as a destination for MOVHLPS, since it's from earlier in the same dep chain but dead now.)

That reminds me: for instructions that do have a real source (like sqrtss), an output dependency on a register that had to be ready earlier in the same dep chain is always safe.

e.g. if we want to keep around ab and sqrtf(ab), we can do this without any ill effects from sqrtss's dependency on its output:

mulss %xmm0, %xmm1
sqrtss %xmm1, %xmm0 # xmm1 being ready means xmm0 is also ready

llvmbot commented 7 years ago

Tracking dep chains would also let the compiler make better decisions about whether to use the copy produced by a MOV, or whether to modify the original. I forget how well clang does with this, but integer MOV reg,reg has non-zero latency on architectures before Intel IvB and AMD Zen. Vector MOVDQA xmm,xmm has zero latency on Bulldozer and IvB and later.

Anyway, making a copy for later use and then modifying the original may shorten latency chains in some cases for some CPUs. I definitely see compiler output that copies a register and then uses that as the input to something else, instead of using the original. I assume this behaviour was baked-in when Intel P6-family register-read port limitations made it desirable.

I don't think there's any downside for the number of physical-register-file entries used. The CPU can free them once no future instructions need them, if they're not part of the architectural state (i.e. the state after the last retired instruction), so I don't think it's worse to make more architectural registers be "recently-written" for most designs.

llvmbot commented 7 years ago

A conservative default for how many instructions ago a register was last written to be considered cold would be ROB size of whatever -mtune= is tuning for. That's the size of the out-of-order execution window. If the core has issued an instruction, all previous instructions older than that are retired (because retirement happens in-order on all to support precise exceptions).

The ROB size on Intel Skylake is 224 fused-domain uops. Every instruction decodes to at least one (except macro-fused compare-and-branch), even if they don't need an execution unit. 224 instructions would be a reasonable approximation, too (except that rep stosb and other micro-coded insns should count as many more, at least 1 per 32B of data movement).

Low-power architectures have much smaller ROB sizes. Older uarches have smaller ROBs, too. Intel Sandybridge was 168 entries. (http://www.realworldtech.com/sandy-bridge/5/).

A reasonable default for -mtune=generic with some future-proofing might be 300.And in most cases, the register we pick won't be the one that has the most stalls. But unless we want to predict (or use profiling info to find) which loads will cache-miss often, and which dependency-chains will be bottlenecks, using a shorter default could cut down the effective out-of-order for a given instruction. OTOH, if the compile-time cost scales with the number, we can probably pick a lower number (like 64) without serious impact in most cases. Or not get nearly as fancy as full dep-chain tracking, and just look for loop-invariant registers for code inside loops.

If anyone cares about Intel Nehalem and earlier, tracking register dependencies and how recently-written they are could allow code-gen to minimize register-read-port stalls. e.g. group together instructions that read the same cold register so they might issue in the same group. Or if given the choice to increment a loop counter before/after using it in an addressing mode, increment it before if tuning for those old architectures. (That kind of thing can hurt other CPUs by requiring more out-of-order resources to find as much ILP, though, so don't do this by default.)

For register-read port stalls, we'd be worried about the best/average case of execution throughput, and Agner Fog says a register can become "cold" as few as 5 cycles after execution on Nehalem.

This is different from avoiding false dependencies, where we're worried about the worst-case of stalls or loop-carried dependencies on the register we want to use.

llvm / llvm-project

feature rq: track "cold" vector registers for use as don't-care sources to avoid false dependencies #32210

Extended Description