Closed IJzerbaard closed 4 years ago
Hey @IJzerbaard , thanks for the PR!
I've been actually investigating the xor vs mul myself. I've been finding that in a number of cases, the optimizer actually decides for itself to replace the mul with an xor(-0, _). There are actually a couple cases where I've used an xor before checking the assembly and realizing this. I'm aware that xor has a 1-cycle latency I believe, but as you point out, it can reduce opportunities for FMA-contraction.
That said, the benchmark should be the source of truth, so I'll try running your PR with the benchmark later today to see what effect there is, so thanks so much!
This also reminds me, I have a few scripts locally to set up a locally running compiler explorer (godbolt) to easily check the disassembly generated. I should clean that up to make it easier for contributors such as yourself.
FYI @IJzerbaard I ended up checking the assembly and I noticed that not all compilers found the optimization as I expected so I think you were onto something. I was actually in the middle of a big memory layout refactor (which allowed me to eliminate some instructions) and I incorporated your changes as well as a bunch of others here. Thanks again for the PR!
Usage of mulps to negate can be replaced with xorps, which can execute on more ports (SKL and newer), and has a lower latency.
Usage of mulss by zero to zero out just the lowest lane can be replaced by blendps, which can execute on more ports (Haswell and newer) and has a lower latency.
I couldn't get the benchmark project to work, otherwise I would have shown those results.. usually these changes help, but there is a slight danger in changing multiply-by-negative-1 to xor-by-negative-zero that it reduces opportunities for FMA-contraction. As far as I know, changing mulss to blendps should just be universally good.