Closed rotateright closed 7 years ago
rL305551
rL302989 adds 'vcmptrueps 0, 0' support for 256-bit vectors.
Another ExecutionDepsFix patch is necessary for 'vcmptrueps undef, undef' for optminsize support.
Out-of-order execution hides any extra bypass-delay latency from integer domain to FP domain, as long as the idiom has no input dependencies (or can otherwise run early enough).
I assume that pcmpeqd -> addpd does have an extra 1 cycle of latency on Intel SnB-family CPUs to forward the integer-vector output to the input of an fp-vector add. But pcmpeqd(same,same) can run in a spare cycle on whatever execution port it was assigned to at issue time (e.g. p1 or p5 for Haswell, p015 for SKL), and have its result forwarded from ivec to fp before the addpd's other input is ready (unless it's also near the start of a new dependency chain, in which case latency isn't typically the most important thing).
One way this could fail, I think, is if a branch mispredict or i-cache miss stalled the front-end, and those two instructions were in the first couple groups to be issued into the out-of-order scheduler. (aka Reservation Station on Intel, where not-yet-executed unfused-domain uops wait for their port to become available). So out-of-order execution doesn't have a window of future instructions yet; ADDPD will have to wait if the OOO core only sees the PCMPEQD less than 2 or 3 cycles before the ADDPD.
Since the bypass-delay latency is only 1 or 2 extra cycles depending on CPU, and we only pay it in rare cases, we should definitely optimize for throughput. (i.e. pick instructions that don't need any execution ports, or run on multiple ports).
Haswell can only run CMPPS/PD/SS/SD on port1 (and it has an input dependency), but it can run PCMPEQD on port 1 or 5 with no input dependency.
AMD Steamroller can run CMPPS on P01, but PCMPEQD on P02.
If running a block of code that's consistently front-end bottlenecked (e.g. slow-to-decode and busts the uop cache because of switching to/from microcode, or just has a mix of uops that can execute at better than 4 per clock), then OOO execution can catch up with the front-end after getting behind due to extra latency.
I tested on Skylake: vcmptrueps %xmm0,%xmm0,%xmm0 has a dependency on its input register. (In a loop, it runs at one per 4 clocks, limited by latency).
On all CPUs, the only port(s) it can run on are the same one(s) that are used by FP-add/sub (because they share hardware), which makes it look worse in a microbenchmark that feeds ADDPD.
If the optimizer can reliably find a cold source register to avoid false dependencies, it should be a bigger win for the AVX1 256b case if the surrounding code doesn't bottleneck on that port. Even with xor-zeroing that can't be hoisted, it makes sense that it's a win over pcmpeqd + vinsertf128.
If we do have to vxorps to zero, using vxorps %xmm0,%xmm0,%xmm0 may have an advantage over vxorps %ymm0,%ymm0,%ymm0 (letting the usual AVX zeroing out to VLMAX take care of zeroing the full %ymm or %zmm). It definitely has a code-size advantage over vxorps %zmm0,%zmm0,%zmm0, because VEX is shorter than EVEX. (I mentioned this in my xor-zeroing SO post: http://stackoverflow.com/a/33668295/224132)
IDK if any CPUs actually run faster with 128b xor-zeroing instead of 256b, but it's not worse. The xor-zeroing case is handled specially anyway, so I assume that Jaguar and other CPUs that crack 256b ops in half avoid that for xor-zeroing. Unless cracking happens at decode, and independence-detection happens later? I can't test this, since I only have Intel hardware.
Keeping track of registers that can be used as don't-care sources while avoiding false dependencies is useful for other things, too (e.g. scalar int->fp conversion, see llvm/llvm-project#22398 #c11). It doesn't have to be xor-zeroed; a register holding a constant works too. TODO: open a new bug for this.
Other than the AVX1-without-AVX2 ymm case, I think pcmpeqd is the way to go. It's the standard idiom, so support for it should continue to improve in future CPUs.
However, Agner Fog says that Intel Silvermont and KNL don't recognize PCMPEQD as being independent of its inputs. (They recognize xor-zeroing, but no all-ones idiom). So that's another use-case for tracking cold registers. (And for AVX512 VPTERNLOG, I guess. Interesting point that compare-into-mask means we need a new all-ones idiom).
The out-of-order window on KNL is fairly small, and it reportedly is usually bottlenecked on front-end throughput, so using more instructions to break dependencies is probably not worth it for KNL / silvermont. (And Silvermont has basically no out-of-order execution for vectors, only integer).
What I haven't yet done is compare the perf of pcmpeq/vxorps+vcmpps for 128-bit (AVX) or 256-bit (AVX2) for different domains, it possibly needs support in the domain switching code as well.
I can confirm that pcmpeq (what we have now) is consistently faster for all 128-bit types - regardless of domain - tested on both Jaguar (~5%) and SandyBridge (7%).
Similarly for AVX2, Carrizo (~6%) prefers pcmpeq for all 256-bit types - I don't have access to an Intel AVX2 machine right now but I'd expect that to be true as well.
I'd recommend that for AVX1-only machines we use the cmpps(xorps()) pattern for 256-types 'all ones' but keep our existing patterns for everything else.
I wonder if we should be inserting a dependency breaking XOR before the VPTERNLOG that we use for 512-bit all ones.
Possibly, although that needs hardware to test ;-)
What I haven't yet done is compare the perf of pcmpeq/vxorps+vcmpps for 128-bit (AVX) or 256-bit (AVX2) for different domains, it possibly needs support in the domain switching code as well.
I wonder if we should be inserting a dependency breaking XOR before the VPTERNLOG that we use for 512-bit all ones.
Further perf tests on Jaguar indicate that:
vxorps %ymm0, %ymm0, %ymm0 vcmpps $15, %ymm0, %ymm0, %ymm0
is consistently faster (by about 9%) than:
vpcmpeqd %xmm0, %xmm0, %xmm0 vinsertf128 $1, %xmm0, %ymm0, %ymm0
Testing equivalent code on a SandyBridge (E5-2640) puts it slightly (~3%) faster as well.
Zeroing the register beforehand seems to be the key for fast-path handling - neither can do much to fast-path an subvector insert.
According to Intel optimization guides, vpcmpeq with the same source has special handling to avoid a dependency on the last producer of the source register. I don't think vcmptrue has the same special treatment.
How does this match up 256-bit vectors on AVX1 only targets such as SandyBridge? Is VPCMPEQD+VINSERTF128 still the best option?
AMD SOGs don't seem to refer to CMPEQ patterns as being fast, I think they're limited to move elimination and XOR for zeroing vectors.
According to Intel optimization guides, vpcmpeq with the same source has special handling to avoid a dependency on the last producer of the source register. I don't think vcmptrue has the same special treatment.
This would also allow AVX1 targets not have to waste time concat'ing xmm all-bits vectors:
define <4 x double> @cmp256_domain(<4 x double> %a) { %cmp = fcmp oeq <4 x double> zeroinitializer, zeroinitializer %sext = sext <4 x i1> %cmp to <4 x i64> %mask = bitcast <4 x i64> %sext to <4 x double> %add = fadd <4 x double> %a, %mask ret <4 x double> %add }
_cmp256_domain: vpcmpeqd %xmm1, %xmm1, %xmm1 vinsertf128 $1, %xmm1, %ymm1, %ymm1 vaddpd %ymm1, %ymm0, %ymm0 retq
I believe, according to Intel's optimization guide, pcmpeq recognizes the idiom and avoid the the dependency on the register, but does still execute the instruction.
I don't think there is an all 1s creating instruction in the fp domain that is guaranteed to always produce 1s given NaNs. Maybe one of the AVX cmpps encodings? But AMD doesn't support those.
Thanks, Craig.
Would "VCMPTRUEP[D/S]" (vcmpps 15, %xmm0, %xmm0, %xmm0) do it?
Re: AMD AVX - it may just be a documentation problem. See bug 28110.
Just tested this "vcmpps 15, %xmm0, %xmm0, %xmm0" (non-signalling true) and it worked fine on Jaguar and Carrizo, we should even be able to do this with UNDEF inputs.
Particularly useful as a way to generate an all-bits vector in a ymm on a AVX1 only machine (whether for float or integer domain).
I believe, according to Intel's optimization guide, pcmpeq recognizes the idiom and avoid the the dependency on the register, but does still execute the instruction.
I don't think there is an all 1s creating instruction in the fp domain that is guaranteed to always produce 1s given NaNs. Maybe one of the AVX cmpps encodings? But AMD doesn't support those.
Thanks, Craig.
Would "VCMPTRUEP[D/S]" (vcmpps 15, %xmm0, %xmm0, %xmm0) do it?
Re: AMD AVX - it may just be a documentation problem. See bug 28110.
I believe, according to Intel's optimization guide, pcmpeq recognizes the idiom and avoid the the dependency on the register, but does still execute the instruction.
I don't think there is an all 1s creating instruction in the fp domain that is guaranteed to always produce 1s given NaNs. Maybe one of the AVX cmpps encodings? But AMD doesn't support those.
Simpler test case: define <2 x double> @cmp_domain(<2 x double> %a) { %mask = bitcast <2 x i64> <i64 -1, i64 -1> to <2 x double> %add = fadd <2 x double> %a, %mask ret <2 x double> %add }
assigned to @dtemirbulatov
Extended Description
Disregard that we're adding NaN values here to just focus on the isel. :)
define <2 x double> @cmp_domain(<2 x double> %a) { %cmp = fcmp oeq <2 x double> zeroinitializer, zeroinitializer %sext = sext <2 x i1> %cmp to <2 x i64> %mask = bitcast <2 x i64> %sext to <2 x double> %add = fadd <2 x double> %a, %mask ret <2 x double> %add }
$ ./llc -o - cmp_domain.ll pcmpeqd %xmm1, %xmm1 addpd %xmm1, %xmm0 retq
The splat of -1 (NaN) is using an integer domain instruction (pcmpeqd), and that's getting used by an FP domain instruction.
I'm not sure if this is an actual problem for any CPU...because the splat-ones-creating-instruction should be recognized as an idiom and not actually require any execution resources?
But I noticed this as part of: http://reviews.llvm.org/D21269
If we do want to fix this, I think we need to see what happens (in ExeDepsFix?) to X86::V_SETALLONES versus X86::V_SET0.