Optimization introduces branch where none existed in the LLVM IR

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.44k stars 12.16k forks source link

Optimization introduces branch where none existed in the LLVM IR #42246

Open llvmbot opened 5 years ago

llvmbot commented 5 years ago


Bugzilla Link	42901
Version	trunk
OS	Linux
Reporter	LLVM Bugzilla Contributor
CC	@adibiagio,@topperc,@dwblaikie,@fhahn,@LebedevRI,@RKSimon,@rotateright,@TNorthover

Extended Description

First, I'm not sure if I've picked the correct component. Please move it elsewhere if needed.

And example of the problem is here:

https://godbolt.org/z/mTzWxX

Notice that the optimized code introduces a conditional operation which isn't great given the fact that the input was crafted to avoid conditional operations.

I ran this through llc over a loop of architectures and this problem occurs on all architectures.

topperc commented 5 years ago

Volatile load bug filed here llvm/llvm-bugzilla-archive#42922

TNorthover commented 5 years ago

That definitely looks wrong to me too.

topperc commented 5 years ago

If we make that load volatile, shouldn't it be executed unconditionally before the branch not on only one leg of the control flow? Is think there may be a volatile bug in our cmov expansion.

LebedevRI commented 5 years ago

I think this (along with llvm/llvm-bugzilla-archive#42903 ) is intended to be (part of) a request for some kind of cryptographic side-channel aware optimization mode.

We're never going to make the kind of promises desired in normal compilation because it would be bad for the vast majority of code that doesn't particularly care whether a conditional branch, a csel, or ALU bit-banging is used as long as it goes fast and gets the right result.

Not Nathaniel, but even in normal opt mode, in these cases branch is pretty much always a bad thing, especially for vectors. Especially if it happens to be within the innermost loop.

...

TNorthover commented 5 years ago

I think this (along with llvm/llvm-bugzilla-archive#42903 ) is intended to be (part of) a request for some kind of cryptographic side-channel aware optimization mode.

We're never going to make the kind of promises desired in normal compilation because it would be bad for the vast majority of code that doesn't particularly care whether a conditional branch, a csel, or ALU bit-banging is used as long as it goes fast and gets the right result.

I've heard the whole concept mentioned once or twice in the past, but never anyone seriously think about tackling it in LLVM.

Nathaniel: I'd probably suggest filing another bug that's explicit about your goals and only uses these as examples of the kind of thing LLVM does against that now. As you've seen here, keeping a bug centred around a specific transformation on track for crypto is not easy.

topperc commented 5 years ago

Alternatively, as it can be observed from -ast-dump, we did match this as X86ISD::CMOV first, but then X86CmovConverter expanded it; perhaps said expansion heuristics should be tuned?

The cmov expansion pass always expands a cmov that involves a load I think.

LebedevRI commented 5 years ago

I'd say no. IR-canonical and target-optimal IR need not be identical things. This is traditionally solved by backed undo transform.

fhahn commented 5 years ago

IIUC, the problem here is that instcombine turns

define void @foo_opt(i64* %x, i1 %c) { %ext = zext i1 %c to i64 %sub = sub i64 %ext, 1

%val = load i64, i64* %x, align 16 %and = and i64 %sub, %val

store i64 %and, i64* %x ret void }

into

define void @foo_opt(i64 nocapture %x, i1 %c) local_unnamed_addr #0 { %val = load i64, i64 %x, align 16 %and = select i1 %c, i64 0, i64 %val store i64 %and, i64* %x, align 4 ret void }

On the IR level, this is a clear win, but for architectures that do not have good support for select, lowering the select is more expensive than the original sequence.

For example, on AArch64, we generate the following for the select version

ldr x8, [x0]
tst w1, #&#8203;0x1
csel    x8, xzr, x8, ne
str x8, [x0]
ret

and the following for the non-select version

ldr x8, [x0]
                                    // kill: def $w1 killed $w1 def $x1
and x9, x1, #&#8203;0x1
sub x9, x9, #&#8203;1              // =1
and x8, x9, x8
str x8, [x0]
ret

Maybe we should estimate the cost of selects for the target when doing such transforms?

llvmbot commented 5 years ago

Alternatively, as it can be observed from -ast-dump, we did match this as X86ISD::CMOV first, but then X86CmovConverter expanded it; perhaps said expansion heuristics should be tuned?

I'm pretty sure I don't want CMOV here at all. This code is trying to be constant time and constant cache.

LebedevRI commented 5 years ago

Alternatively, as it can be observed from -ast-dump, we did match this as X86ISD::CMOV first, but then X86CmovConverter expanded it; perhaps said expansion heuristics should be tuned?

LebedevRI commented 5 years ago

So basically, we should be doing the following transform in backend: https://rise4fun.com/Alive/zPA

Name: naive %r = select i1 %c, i64 0, i64 %val => %ext = zext i1 %c to i64 %sub = sub i64 %ext, 1 %r = and i64 %sub, %val

BUT. It's illegal if %val is poison. We do have poison in backend, that is not middle-end only concern, right?