Open andyhhp opened 2 years ago
Maybe disabling loop unroll (-fno-unroll-loops
) is better for this case.
Each iteration of the loop (and therefore duplicated 8 times), the same constant is reloaded into %edx despite the register not being clobbered
Seems like we are missing an optimization after instruction selection. the ISel creates that move and is never hoisted.
# *** IR Dump After Finalize ISel and expand pseudo-instructions (finalize-isel) ***:
# Machine code for function check_pat: IsSSA, TracksLiveness
Function Live Ins: $rdi in %2
bb.0.entry:
successors: %bb.9(0x09249249), %bb.10(0x76db6db7); %bb.9(7.14%), %bb.10(92.86%)
liveins: $rdi
%2:gr64_with_sub_8bit = COPY $rdi
%4:gr8 = COPY %2.sub_8bit:gr64_with_sub_8bit
%3:gr32 = MOV32r0 implicit-def dead $eflags
%5:gr8 = SUB8ri %4:gr8(tied-def 0), 7, implicit-def $eflags
JCC_1 %bb.9, 7, implicit $eflags
bb.10.entry:
; predecessors: %bb.0
successors: %bb.1(0x76276276), %bb.9(0x09d89d8a); %bb.1(92.31%), %bb.9(7.69%)
%6:gr32 = MOVZX32rr8 %4:gr8
%7:gr32 = MOV32ri 243 <--------------------------------------
BT32rr killed %7:gr32, killed %6:gr32, implicit-def $eflags
JCC_1 %bb.1, 2, implicit $eflags
JMP_1 %bb.9
Can global isel avoid this situation?
Maybe disabling loop unroll (
-fno-unroll-loops
) is better for this case.
So yes - if I were micro-optimising, I could, but this was a tiny example taken out a hypervisor, and throwing -fno-unroll-loops
around at the toplevel would be wholly inappropriate. Clearly there are some issues with the default decisions about unrolling, and small loops like this are not a rare pattern to find across a kernel, so more appropriate unrolling decisions could have quite a large improvement overall.
Seems like we are missing an optimization after instruction selection.
So one thing I did wonder. bt
is part of a group of 4 instructions, along with Bit Set/Compliment/Reset. Bit Test is the odd one out, being a read-only non-destructive instruction. I don't know how to read IR, but I'm suspicious by the killed %7:gr32, killed %6:gr32
because both of the registers containing those values are still valid after the instruction completes.
don't know how to read IR, but I'm suspicious by the killed %7:gr32, killed %6:gr32 because both of the registers containing those values are still valid after the instruction completes.
That's a good observation! The kill is added during the instruction selection which works at a basic block level. I think we need some sort of value propagation (available expression) to fix this. In principle, I'd expect global isel to trivially take care of this particular issue although the larger issue of not having value-propagation in the backend would still remain.
cc @zmodem for the bit test block
cc @zmodem for the bit test block
The only problem I see with the bit test block is the repeated "movl $243, %edx", which I think we can't fix in the switch lowering since that's local to a basic block. I'm surprised there's nothing which cleans that up afterwards (MachineCSE?)
The return value for the function is a 0 from xor %eax in the first instruction, or picked up as a 1 from the .Lswitch.table.check_pat:. I don't even know what to call this transformation, but it would be far better replaced with bt as per earlier iterations, and a single setc %al to drop the memory load and 32 byte(!) table.
The transformation is "switch to lookup table" (https://github.com/llvm/llvm-project/blob/llvmorg-14.0.0/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5837).
This is generally a good transformation, as in most cases it replaces an indirect branch through a jump table with a direct load of the return value from a lookup table. But in your case it's clearly not a win.
The missing piece in LLVM is that it doesn't narrow the 32-bit return values. If it did, it would pack the lookup table into a register and use "bt" for the lookup. You can see it working when changing the return type of your function to bool: https://godbolt.org/z/ojnbfnxd4 This is https://github.com/llvm/llvm-project/issues/29879
The transformation is "switch to lookup table"
Thanks!
I can absolutely see why this is a useful transformation in the general case. These days, kernels are all built with retpoline which implies -fno-jump-tables
so converting the entire switch statement is absolutely a win. [Edit: implicit no-jump-tables doesn't inhibit this optimisation, but explicit does]
I suppose what confused me so much about this was the fact that only the final iteration of the loop had this transformation applied. I would have expected 8 instances or none.
Experimenting with code gen options, it turns out that -fno-jump-tables
inhibits this optimisation, and causes a bt
to be used. With -fno-unroll-loops
, the code gen is per -O1
, again with no transformation to a lookup table.
For the lookup table itself, playing with the function is interesting. It clearly shows that the transformation is linked to the return type. Is this perhaps because when the loop has been unrolled the continue;
inside the switch can/does become return 1;
and becomes exclusively return paths?
It really would be a good idea to optimise the table, rather than fixing the element size at the return type size. Using e.g. movbzl
would quarter the size of the table in this example. int
is an incredibly common return type and I'd wager that most examples of this transformation don't require 32 bits worth of range. It is possibly even worth spotting where you can shrink the element size at the cost of one extra add $slide, %val
(probably a certain improvement for -Os
). This would also provide the opportunity to convert to a bit test in cases such as this one.
I suppose what confused me so much about this was the fact that only the final iteration of the loop had this transformation applied. I would have expected 8 instances or none.
It's only in the last iteration that the "val & 0xff" is mapped directly to a return value, which is where the transformation kicks in. In the other iterations there is different control flow, either returning or continuing with the next value.
Experimenting with code gen options, it turns out that -fno-jump-tables inhibits this optimisation, and causes a bt to be used.
That was due to https://reviews.llvm.org/D35579 Looking back, I'm not sure how well motivated that actually was.
It really would be a good idea to optimise the table, rather than fixing the element size at the return type size. Using e.g. movbzl would quarter the size of the table in this example. int is an incredibly common return type and I'd wager that most examples of this transformation don't require 32 bits worth of range.
I agree. It's been on the todo-list a long time, and you pointing this out has bumped it higher on that list I'd say.
It is possibly even worth spotting where you can shrink the element size at the cost of one extra add $slide, %val
Yes, that seems worth doing.
Why does being mapped to a return value matter? There are plenty of switch statements which aren't the sole contents of a function which could potentially benefit.
I do understand the concern with random memory reference. It does have poor locality of reference, and is likely to miss in the cache unless the switch statement is on a hotpath. A superscalar processor probably can absorb this stall, while more simple designs probably cant. That said, blindly disabling this optimisation is equally bad, because for every architecture/design, there will be a point at which such a stall is still less overhead than executing the if/else chain.
In the simple case, optimising the table size improves the locality of reference simply by making the executable size smaller, and therefore more likely to already be in TLB/cache from other references into .rodata.
While I have reservations making this suggestion, it is worth mentioning just for completeness. On x86 at least, the locality can be improved further by emitting the table beside the function, because the translation will be present in the L2TLB and the prefetcher will likely have pulled the lines into some level of the cache. The downside of placing the data next to text is that it can in principle be decoded and take up frontend resource. It can in principle also be speculatively executed, but this isn't semantically different from speculative execution due to bad earlier prediction landing in the middle of an instruction.
Why does being mapped to a return value matter? There are plenty of switch statements which aren't the sole contents of a function which could potentially benefit.
What I was trying to say was that the "switch to lookup table" transformation only works for switches which are used to select among values, not for switches in general where different inputs generate different control flow.
In the example, the first switches aren't used to select a value, but to select control flow ("return 0" or "continue to next iteration"), so this transformation doesn't apply to them.
On x86 at least, the locality can be improved further by emitting the table beside the function.
Yes, that could be applied to both lookup tables and jump tables. It doesn't look like other compilers do it though: https://godbolt.org/z/PWK4h4sEr Maybe it doesn't play well with split instruction and data caches?
The downside of placing the data next to text is that it can in principle be decoded and take up frontend resource.
I wonder if this would still be a problem if the compiler made sure to for example always put it after a return instruction or similar.
What I was trying to say was that the "switch to lookup table" transformation only works for switches which are used to select among values, not for switches in general where different inputs generate different control flow.
Take this variation of the original code, which is logically equivalent. https://godbolt.org/z/1hdPr38fv The control flow isn't semantically different between the penultimate iteration and the final iteration. I can entirely believe that this might be harder to spot, but the transformation is equally applicable to each iteration of the loop.
Yes, that could be applied to both lookup tables and jump tables.
Yes.
It doesn't look like other compilers do it though: https://godbolt.org/z/PWK4h4sEr
This particular example can be optimised as:
if (x == 3)
h(8);
else if (x <= 6)
g(.Lswitch.table[x]);
as there isn't an obvious transformation between x and g()'s input. If you're going to take the hit of memory reference, this form is preferable, particularly in a post-Spectre world when indirect branches are far more expensive. (Again, I could entirely believe that this case is very hard to effectively spot/optimise.)
Maybe it doesn't play well with split instruction and data caches?
In this case, we're talking about rodata, so it's fine in general. (Having a write hitting an in-flight instruction is devastating for performance.) In this case, you're trading off potential uarch hits from possibly speculatively decoding/executing the table, vs uarch hits from bad data locality.
I wonder if this would still be a problem if the compiler made sure to for example always put it after a return instruction or similar.
In short, yes. Several microarchitectures suffer from Straight Line Speculation including past rets. Some microarchitectures have truncated prediction structures. Some microarchitectures have short and long targets in the indirect predictor, where short targets share the upper bits from source and destination.
Branch prediction is not perfect, and there is a nonzero chance of the table getting speculative decoded and/or executed. What matters is whether this is more or less overhead than the memory reference pointing at rodata.
But honestly, this is by far the most minor option discussed on this thread. Not unrolling the loop is the biggest thing, followed by optimising the table itself.
The downside of placing the data next to text is that it can in principle be decoded and take up frontend resource.
This has been a problem getting ARM constant pools to run as execute only (XOM).
The downside of placing the data next to text is that it can in principle be decoded and take up frontend resource.
This has been a problem getting ARM constant pools to run as execute only (XOM).
Very good point, and it would be an issue on x86 too when using PKRU/PKS to get voluntary XOM.
Sorry for the generic title, but I can't think of a better categorisation (other than perhaps "code generation so bad I'd like my money back" :smiley:).
Full example: https://godbolt.org/z/MKc7Meh5x
This is a real piece of logic for auditing guest updates to a register (x86's MSR_PAT specifically), which is a slowpath in traditional virtualisation, but a fairly fastpath for nested virtualisation. Given:
the code generation at
-O1
looks pretty good. GCC manages slightly better (by dropping another conditional jump in the loop body) but it's a simple loop. (More on this later).However, the code generation at
-O2
is outrageous:which has a number of issues
%edx
despite the register not being clobberedxor %eax
in the first instruction, or picked up as a 1 from the.Lswitch.table.check_pat:
. I don't even know what to call this transformation, but it would be far better replaced withbt
as per earlier iterations, and a singlesetc %al
to drop the memory load and 32 byte(!) table.This loop should not be unrolled at any optimisation level. It's a fixed number of iterations with a simple induction variable, so can be predicted perfectly on even ~10yo hardware. The loop carry dependency is trivial, as it's data shifted out of
val
one byte at a time, and there's no latency-sensitive work which can be shuffled earlier. Furthermore, decode bandwidth which would be decoding beyond the loop is wasted re-decoding the same uops which could be served from the uop cache.Genuinely, the
-O1
code generation is far preferable to anything that higher optimisation levels spit out, in terms of binary size, runtime speed, and power utilisation. (This example is too small, but longer loops which fit in the uop cache will allow power savings.)