loop-unrolling anti-optimisation at -O2

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

26.86k stars 11.02k forks source link

loop-unrolling anti-optimisation at -O2 #54644

Open andyhhp opened 2 years ago

andyhhp commented 2 years ago

Sorry for the generic title, but I can't think of a better categorisation (other than perhaps "code generation so bad I'd like my money back" :smiley:).

Full example: https://godbolt.org/z/MKc7Meh5x

This is a real piece of logic for auditing guest updates to a register (x86's MSR_PAT specifically), which is a slowpath in traditional virtualisation, but a fairly fastpath for nested virtualisation. Given:

#include <stdint.h>

int check_pat(uint64_t val)
{
    unsigned int i;

    for ( i = 0; i < 8; i++, val >>= 8 )
    {
        switch ( val & 0xff )
        {
        case 0:
        case 1:
        case 4:
        case 5:
        case 6:
        case 7:
            continue;

        default:
            return 0;
        }
    }

    return 1;
}

the code generation at -O1 looks pretty good. GCC manages slightly better (by dropping another conditional jump in the loop body) but it's a simple loop. (More on this later).

check_pat:                              # @check_pat
        movl    $8, %eax
        jmp     .LBB0_1
.LBB0_4:                                #   in Loop: Header=BB0_1 Depth=1
        shrq    $8, %rdi
        decl    %eax
        je      .LBB0_5
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        leal    -4(%rdi), %ecx
        cmpb    $4, %cl
        jb      .LBB0_4
        cmpb    $1, %dil
        jbe     .LBB0_4
        xorl    %eax, %eax
        retq
.LBB0_5:
        movl    $1, %eax
        retq

However, the code generation at -O2 is outrageous:

check_pat:                              # @check_pat
        xorl    %eax, %eax
        cmpb    $7, %dil
        ja      .LBB0_16
        movzbl  %dil, %ecx
        movl    $243, %edx
        btl     %ecx, %edx
        jae     .LBB0_16
        movq    %rdi, %rcx
        shrq    $8, %rcx
        cmpb    $7, %cl
        ja      .LBB0_16
        movzbl  %cl, %ecx
        movl    $243, %edx
        btl     %ecx, %edx
        jae     .LBB0_16

# <snip lots of repeats>

        movq    %rdi, %rcx
        shrq    $59, %rcx
        jne     .LBB0_16
        sarq    $56, %rdi
        movl    .Lswitch.table.check_pat(,%rdi,4), %eax
.LBB0_16:
        retq
.Lswitch.table.check_pat:
        .long   1                               # 0x1
        .long   1                               # 0x1
        .long   0                               # 0x0
        .long   0                               # 0x0
        .long   1                               # 0x1
        .long   1                               # 0x1
        .long   1                               # 0x1
        .long   1                               # 0x1

which has a number of issues

The loop is unrolled (again, more on this later)
Each iteration of the loop (and therefore duplicated 8 times), the same constant is reloaded into %edx despite the register not being clobbered
The return value for the function is a 0 from xor %eax in the first instruction, or picked up as a 1 from the .Lswitch.table.check_pat:. I don't even know what to call this transformation, but it would be far better replaced with bt as per earlier iterations, and a single setc %al to drop the memory load and 32 byte(!) table.

This loop should not be unrolled at any optimisation level. It's a fixed number of iterations with a simple induction variable, so can be predicted perfectly on even ~10yo hardware. The loop carry dependency is trivial, as it's data shifted out of val one byte at a time, and there's no latency-sensitive work which can be shuffled earlier. Furthermore, decode bandwidth which would be decoding beyond the loop is wasted re-decoding the same uops which could be served from the uop cache.

Genuinely, the -O1 code generation is far preferable to anything that higher optimisation levels spit out, in terms of binary size, runtime speed, and power utilisation. (This example is too small, but longer loops which fit in the uop cache will allow power savings.)

hiraditya commented 2 years ago

Maybe disabling loop unroll (-fno-unroll-loops) is better for this case.

Each iteration of the loop (and therefore duplicated 8 times), the same constant is reloaded into %edx despite the register not being clobbered

Seems like we are missing an optimization after instruction selection. the ISel creates that move and is never hoisted.

# *** IR Dump After Finalize ISel and expand pseudo-instructions (finalize-isel) ***:
# Machine code for function check_pat: IsSSA, TracksLiveness
Function Live Ins: $rdi in %2 

bb.0.entry:
  successors: %bb.9(0x09249249), %bb.10(0x76db6db7); %bb.9(7.14%), %bb.10(92.86%)
  liveins: $rdi
  %2:gr64_with_sub_8bit = COPY $rdi
  %4:gr8 = COPY %2.sub_8bit:gr64_with_sub_8bit
  %3:gr32 = MOV32r0 implicit-def dead $eflags
  %5:gr8 = SUB8ri %4:gr8(tied-def 0), 7, implicit-def $eflags
  JCC_1 %bb.9, 7, implicit $eflags

bb.10.entry:
; predecessors: %bb.0 
  successors: %bb.1(0x76276276), %bb.9(0x09d89d8a); %bb.1(92.31%), %bb.9(7.69%)

  %6:gr32 = MOVZX32rr8 %4:gr8
  %7:gr32 = MOV32ri 243   <--------------------------------------
  BT32rr killed %7:gr32, killed %6:gr32, implicit-def $eflags
  JCC_1 %bb.1, 2, implicit $eflags
  JMP_1 %bb.9

hiraditya commented 2 years ago

Can global isel avoid this situation?

andyhhp commented 2 years ago

Maybe disabling loop unroll (-fno-unroll-loops) is better for this case.

So yes - if I were micro-optimising, I could, but this was a tiny example taken out a hypervisor, and throwing -fno-unroll-loops around at the toplevel would be wholly inappropriate. Clearly there are some issues with the default decisions about unrolling, and small loops like this are not a rare pattern to find across a kernel, so more appropriate unrolling decisions could have quite a large improvement overall.

Seems like we are missing an optimization after instruction selection.

So one thing I did wonder. bt is part of a group of 4 instructions, along with Bit Set/Compliment/Reset. Bit Test is the odd one out, being a read-only non-destructive instruction. I don't know how to read IR, but I'm suspicious by the killed %7:gr32, killed %6:gr32 because both of the registers containing those values are still valid after the instruction completes.

hiraditya commented 2 years ago

don't know how to read IR, but I'm suspicious by the killed %7:gr32, killed %6:gr32 because both of the registers containing those values are still valid after the instruction completes.

That's a good observation! The kill is added during the instruction selection which works at a basic block level. I think we need some sort of value propagation (available expression) to fix this. In principle, I'd expect global isel to trivially take care of this particular issue although the larger issue of not having value-propagation in the backend would still remain.

nickdesaulniers commented 2 years ago

cc @zmodem for the bit test block

zmodem commented 2 years ago

cc @zmodem for the bit test block

The only problem I see with the bit test block is the repeated "movl $243, %edx", which I think we can't fix in the switch lowering since that's local to a basic block. I'm surprised there's nothing which cleans that up afterwards (MachineCSE?)

The return value for the function is a 0 from xor %eax in the first instruction, or picked up as a 1 from the .Lswitch.table.check_pat:. I don't even know what to call this transformation, but it would be far better replaced with bt as per earlier iterations, and a single setc %al to drop the memory load and 32 byte(!) table.

The transformation is "switch to lookup table" (https://github.com/llvm/llvm-project/blob/llvmorg-14.0.0/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L5837).

This is generally a good transformation, as in most cases it replaces an indirect branch through a jump table with a direct load of the return value from a lookup table. But in your case it's clearly not a win.

The missing piece in LLVM is that it doesn't narrow the 32-bit return values. If it did, it would pack the lookup table into a register and use "bt" for the lookup. You can see it working when changing the return type of your function to bool: https://godbolt.org/z/ojnbfnxd4 This is https://github.com/llvm/llvm-project/issues/29879

andyhhp commented 2 years ago

The transformation is "switch to lookup table"

Thanks!

I can absolutely see why this is a useful transformation in the general case. These days, kernels are all built with retpoline which implies -fno-jump-tables so converting the entire switch statement is absolutely a win. [Edit: implicit no-jump-tables doesn't inhibit this optimisation, but explicit does]

I suppose what confused me so much about this was the fact that only the final iteration of the loop had this transformation applied. I would have expected 8 instances or none.

Experimenting with code gen options, it turns out that -fno-jump-tables inhibits this optimisation, and causes a bt to be used. With -fno-unroll-loops, the code gen is per -O1, again with no transformation to a lookup table.

For the lookup table itself, playing with the function is interesting. It clearly shows that the transformation is linked to the return type. Is this perhaps because when the loop has been unrolled the continue; inside the switch can/does become return 1; and becomes exclusively return paths?

It really would be a good idea to optimise the table, rather than fixing the element size at the return type size. Using e.g. movbzl would quarter the size of the table in this example. int is an incredibly common return type and I'd wager that most examples of this transformation don't require 32 bits worth of range. It is possibly even worth spotting where you can shrink the element size at the cost of one extra add $slide, %val (probably a certain improvement for -Os). This would also provide the opportunity to convert to a bit test in cases such as this one.

zmodem commented 2 years ago

I suppose what confused me so much about this was the fact that only the final iteration of the loop had this transformation applied. I would have expected 8 instances or none.

It's only in the last iteration that the "val & 0xff" is mapped directly to a return value, which is where the transformation kicks in. In the other iterations there is different control flow, either returning or continuing with the next value.

Experimenting with code gen options, it turns out that -fno-jump-tables inhibits this optimisation, and causes a bt to be used.

That was due to https://reviews.llvm.org/D35579 Looking back, I'm not sure how well motivated that actually was.

It really would be a good idea to optimise the table, rather than fixing the element size at the return type size. Using e.g. movbzl would quarter the size of the table in this example. int is an incredibly common return type and I'd wager that most examples of this transformation don't require 32 bits worth of range.

I agree. It's been on the todo-list a long time, and you pointing this out has bumped it higher on that list I'd say.

It is possibly even worth spotting where you can shrink the element size at the cost of one extra add $slide, %val

Yes, that seems worth doing.

andyhhp commented 2 years ago

Why does being mapped to a return value matter? There are plenty of switch statements which aren't the sole contents of a function which could potentially benefit.

I do understand the concern with random memory reference. It does have poor locality of reference, and is likely to miss in the cache unless the switch statement is on a hotpath. A superscalar processor probably can absorb this stall, while more simple designs probably cant. That said, blindly disabling this optimisation is equally bad, because for every architecture/design, there will be a point at which such a stall is still less overhead than executing the if/else chain.

In the simple case, optimising the table size improves the locality of reference simply by making the executable size smaller, and therefore more likely to already be in TLB/cache from other references into .rodata.

While I have reservations making this suggestion, it is worth mentioning just for completeness. On x86 at least, the locality can be improved further by emitting the table beside the function, because the translation will be present in the L2TLB and the prefetcher will likely have pulled the lines into some level of the cache. The downside of placing the data next to text is that it can in principle be decoded and take up frontend resource. It can in principle also be speculatively executed, but this isn't semantically different from speculative execution due to bad earlier prediction landing in the middle of an instruction.

zmodem commented 2 years ago

Why does being mapped to a return value matter? There are plenty of switch statements which aren't the sole contents of a function which could potentially benefit.

What I was trying to say was that the "switch to lookup table" transformation only works for switches which are used to select among values, not for switches in general where different inputs generate different control flow.

In the example, the first switches aren't used to select a value, but to select control flow ("return 0" or "continue to next iteration"), so this transformation doesn't apply to them.

On x86 at least, the locality can be improved further by emitting the table beside the function.

Yes, that could be applied to both lookup tables and jump tables. It doesn't look like other compilers do it though: https://godbolt.org/z/PWK4h4sEr Maybe it doesn't play well with split instruction and data caches?

The downside of placing the data next to text is that it can in principle be decoded and take up frontend resource.

I wonder if this would still be a problem if the compiler made sure to for example always put it after a return instruction or similar.

andyhhp commented 2 years ago

What I was trying to say was that the "switch to lookup table" transformation only works for switches which are used to select among values, not for switches in general where different inputs generate different control flow.

Take this variation of the original code, which is logically equivalent. https://godbolt.org/z/1hdPr38fv The control flow isn't semantically different between the penultimate iteration and the final iteration. I can entirely believe that this might be harder to spot, but the transformation is equally applicable to each iteration of the loop.

Yes, that could be applied to both lookup tables and jump tables.

Yes.

It doesn't look like other compilers do it though: https://godbolt.org/z/PWK4h4sEr

This particular example can be optimised as:

    if (x == 3)
         h(8);
    else if (x <= 6)
         g(.Lswitch.table[x]);

as there isn't an obvious transformation between x and g()'s input. If you're going to take the hit of memory reference, this form is preferable, particularly in a post-Spectre world when indirect branches are far more expensive. (Again, I could entirely believe that this case is very hard to effectively spot/optimise.)

Maybe it doesn't play well with split instruction and data caches?

In this case, we're talking about rodata, so it's fine in general. (Having a write hitting an in-flight instruction is devastating for performance.) In this case, you're trading off potential uarch hits from possibly speculatively decoding/executing the table, vs uarch hits from bad data locality.

I wonder if this would still be a problem if the compiler made sure to for example always put it after a return instruction or similar.

In short, yes. Several microarchitectures suffer from Straight Line Speculation including past rets. Some microarchitectures have truncated prediction structures. Some microarchitectures have short and long targets in the indirect predictor, where short targets share the upper bits from source and destination.

Branch prediction is not perfect, and there is a nonzero chance of the table getting speculative decoded and/or executed. What matters is whether this is more or less overhead than the memory reference pointing at rodata.

But honestly, this is by far the most minor option discussed on this thread. Not unrolling the loop is the biggest thing, followed by optimising the table itself.

nickdesaulniers commented 2 years ago

The downside of placing the data next to text is that it can in principle be decoded and take up frontend resource.

This has been a problem getting ARM constant pools to run as execute only (XOM).

andyhhp commented 2 years ago

The downside of placing the data next to text is that it can in principle be decoded and take up frontend resource.

This has been a problem getting ARM constant pools to run as execute only (XOM).

Very good point, and it would be an issue on x86 too when using PKRU/PKS to get voluntary XOM.