[X86] Use BTS to set upper single bit on fast x64 targets

Raised here: https://reviews.llvm.org/D132520#inline-1278500 (we already do this on upper 32-bits for -Os/-Oz but some targets (atom/slm, pre-haswell etc. it might be worth doing in more cases).

orl $65536, %edi # imm = 0x10000

BTW, with -Oz at least, we should be using 4-byte bts $16, %edi instead of 6-byte or $65536, %edi (or 5-byte for EAX).

On Intel CPUs bts $i8, %reg is still only 1 uop, although can run on fewer execution ports than or (p06 in SKL/ICL; the shift ports. Only p1 in Alder Lake P-cores). Appropriate at least for -Os -mtune=intel (or any specific Sandybridge-family) if we want to be that fine-grained about different instruction selection. Perhaps even -O2 -mtune=intel. Although maybe not, since Alder Lake P-cores dropped the throughput to 1, competing with imul and tzcnt/lzcnt/popcnt for that port. (BTS still has 1 cycle latency, unlike most integer uops that can only run on port 1. Alder Lake E-cores run as 1 uop with 1 cycle latency to the integer output, 2 cycle latency to the CF output.)

So maybe only for -O2 with a -march before Alder Lake? But people normally expect -march=haswell to be good on later Intel, and it's a pretty small savings, just 2 bytes. OTOH it might be a pretty small gain, unless used in a loop with a port 1 bottleneck.

On AMD CPUs, bts $imm, %reg is 2 uops, so only appropriate for -Oz and maybe -Os.

llvm / llvm-project

[X86] Use BTS to set upper single bit on fast x64 targets #57810