llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.84k stars 11.46k forks source link

BTS/BTR are preferable to MOVABS+AND #35259

Open davezarzycki opened 6 years ago

davezarzycki commented 6 years ago
Bugzilla Link 35911
Version trunk
OS All
CC @topperc,@davezarzycki,@preames,@RKSimon,@rotateright,@ZviRackover

Extended Description

On both high-end and low-end Intel processors, using BTS/BTR to set/clear any of the high 32 bits of a 64 bit register is preferable to the current MOVABS followed by an AND instruction. BTS/BTR have latency of 1 cycle, but MOVABS+AND has a combined latency of 2 cycles. The code density of BTS/BTR is more than twice as good as MOVABS+AND and therefore decoder throughput is much better too.

davezarzycki commented 6 years ago

For your consideration, please consider the following counterfactual argument:

Imagine a parallel timeline where LLVM emitted BT instructions for all single bit operations. Instead of discussing whether the BT instructions are worth it, in that timeline we'd be discussing whether MOVABS plus AND/OR/XOR can be profitable in some scenarios. I think that answer is "sometimes": when spare registers are available AND when the result of a given MOVABS can be reused.

preames commented 6 years ago

cc'ing Philip and Zvi from D37418.

David makes good points about decode bandwidth and code size here. I think we agreed (or at least there was no objection) that D37418 would always make sense with -Os (optsize), so start there, extend it to regular optimization mode, and see if there are any perf wins/losses/complaints/regressions? Starting w/O0 definitely makes sense.

My main concern here is what happens if we have this pattern in a hot loop. If we have a spare register, using the movabs in the preheader is profitable. If we don't, then using bt is probably the right answer. My guess here is that we might be able to get away with preferring bt in practice, but this is the case I'd worry about causing regressions if anything is going to. I think we should try it and find out though.

rotateright commented 6 years ago

cc'ing Philip and Zvi from D37418.

David makes good points about decode bandwidth and code size here. I think we agreed (or at least there was no objection) that D37418 would always make sense with -Os (optsize), so start there, extend it to regular optimization mode, and see if there are any perf wins/losses/complaints/regressions?

davezarzycki commented 6 years ago

Hi Craig – I just looked at the test suite changes as a part of your patch: https://reviews.llvm.org/D37418

Personally, I'd limit the merging MOVABS and OR/AND/XOR into BTS/BTR/BTC if-and-only-if the result of MOVABS is used once. That's the clear win here. If the MOVABS result can be reused, then the BT instructions are net negative due to backend bandwidth limitations and code density (BT are two-byte "0F" opcodes, whereas AND/OR/XOR or single byte opcodes).

davezarzycki commented 6 years ago

PS – LLVM already generates BT to test bits 32 to 62 (zero indexed). Using MOVABS+AND/OR for the set/clear side seems inconsistent.

davezarzycki commented 6 years ago

The BTS/BTR versus AND/OR backend bandwidth is a distraction. The real problem is MOVABS. Unless I'm missing something, here is the "net throughput math":

MOVABS+AND/OR is 13 bytes long. BTS/BTR is 5 bytes long.

The frontend decodes 16 bytes at a time and queues 2 to 5 instructions depending on the CPU.

If one has a long stream of MOVABS+AND/OR operations, then the frontend bandwidth would be the limiting factor, because the "BTS/BTR equivalent" operations would be 1.23 per cycle.

However, if one has a long stream of actual BTS/BTR operations, the backend bandwidth would be the limiting factor. The frontend would be able to deliver up to 3.2 BTSs or BTRs per cycle, but the backend would limit that to 2/cycle. This is better than MOVABS+AND/OR and it leaves frontend bandwidth available for other instructions.

As a point of comparison, on Intel's "little" CPUs (a.k.a. Atom, etc). The math is essentially the same. MOVABS destroys frontend bandwidth, but actual BTS/BTR instructions generate better net total throughput.

Finally, the other advantage of BTS/BTR over MOVABS+AND/OR is less register pressure, and therefore fewer spills.

topperc commented 6 years ago

I had a patch for this a while back https://reviews.llvm.org/D37418 but there was lots of discussion about the reduced throughput of bts/btr relative to or/and