Raised here: https://reviews.llvm.org/D132520#inline-1278500 (we already do this on upper 32-bits for -Os/-Oz but some targets (atom/slm, pre-haswell etc. it might be worth doing in more cases).
orl $65536, %edi # imm = 0x10000
BTW, with -Oz at least, we should be using 4-byte bts $16, %edi instead of 6-byte or $65536, %edi (or 5-byte for EAX).
On Intel CPUs bts $i8, %reg is still only 1 uop, although can run on fewer execution ports than or (p06 in SKL/ICL; the shift ports. Only p1 in Alder Lake P-cores). Appropriate at least for -Os -mtune=intel (or any specific Sandybridge-family) if we want to be that fine-grained about different instruction selection.
Perhaps even -O2 -mtune=intel. Although maybe not, since Alder Lake P-cores dropped the throughput to 1, competing with imul and tzcnt/lzcnt/popcnt for that port. (BTS still has 1 cycle latency, unlike most integer uops that can only run on port 1. Alder Lake E-cores run as 1 uop with 1 cycle latency to the integer output, 2 cycle latency to the CF output.)
So maybe only for -O2 with a -march before Alder Lake? But people normally expect -march=haswell to be good on later Intel, and it's a pretty small savings, just 2 bytes. OTOH it might be a pretty small gain, unless used in a loop with a port 1 bottleneck.
On AMD CPUs, bts $imm, %reg is 2 uops, so only appropriate for -Oz and maybe -Os.
Raised here: https://reviews.llvm.org/D132520#inline-1278500 (we already do this on upper 32-bits for
-Os
/-Oz
but some targets (atom/slm, pre-haswell etc. it might be worth doing in more cases).BTW, with
-Oz
at least, we should be using 4-bytebts $16
,%edi
instead of 6-byte or$65536
,%edi
(or 5-byte for EAX).On Intel CPUs
bts $i8
,%reg
is still only 1 uop, although can run on fewer execution ports than or (p06 in SKL/ICL; the shift ports. Only p1 in Alder Lake P-cores). Appropriate at least for-Os -mtune=intel
(or any specific Sandybridge-family) if we want to be that fine-grained about different instruction selection. Perhaps even-O2 -mtune=intel
. Although maybe not, since Alder Lake P-cores dropped the throughput to 1, competing withimul
andtzcnt
/lzcnt
/popcnt
for that port. (BTS still has 1 cycle latency, unlike most integer uops that can only run on port 1. Alder Lake E-cores run as 1 uop with 1 cycle latency to the integer output, 2 cycle latency to the CF output.)So maybe only for
-O2
with a-march
before Alder Lake? But people normally expect-march=haswell
to be good on later Intel, and it's a pretty small savings, just 2 bytes. OTOH it might be a pretty small gain, unless used in a loop with a port 1 bottleneck.On AMD CPUs,
bts $imm
,%reg
is 2 uops, so only appropriate for-Oz
and maybe-Os
.