Open llvmbot opened 5 years ago
Volatile load bug filed here llvm/llvm-bugzilla-archive#42922
That definitely looks wrong to me too.
If we make that load volatile, shouldn't it be executed unconditionally before the branch not on only one leg of the control flow? Is think there may be a volatile bug in our cmov expansion.
I think this (along with llvm/llvm-bugzilla-archive#42903 ) is intended to be (part of) a request for some kind of cryptographic side-channel aware optimization mode.
We're never going to make the kind of promises desired in normal compilation because it would be bad for the vast majority of code that doesn't particularly care whether a conditional branch, a csel, or ALU bit-banging is used as long as it goes fast and gets the right result.
Not Nathaniel, but even in normal opt mode, in these cases branch is pretty much always a bad thing, especially for vectors. Especially if it happens to be within the innermost loop.
...
I think this (along with llvm/llvm-bugzilla-archive#42903 ) is intended to be (part of) a request for some kind of cryptographic side-channel aware optimization mode.
We're never going to make the kind of promises desired in normal compilation because it would be bad for the vast majority of code that doesn't particularly care whether a conditional branch, a csel, or ALU bit-banging is used as long as it goes fast and gets the right result.
I've heard the whole concept mentioned once or twice in the past, but never anyone seriously think about tackling it in LLVM.
Nathaniel: I'd probably suggest filing another bug that's explicit about your goals and only uses these as examples of the kind of thing LLVM does against that now. As you've seen here, keeping a bug centred around a specific transformation on track for crypto is not easy.
Alternatively, as it can be observed from -ast-dump, we did match this as X86ISD::CMOV first, but then X86CmovConverter expanded it; perhaps said expansion heuristics should be tuned?
The cmov expansion pass always expands a cmov that involves a load I think.
I'd say no. IR-canonical and target-optimal IR need not be identical things. This is traditionally solved by backed undo transform.
IIUC, the problem here is that instcombine turns
define void @foo_opt(i64* %x, i1 %c) { %ext = zext i1 %c to i64 %sub = sub i64 %ext, 1
%val = load i64, i64* %x, align 16 %and = and i64 %sub, %val
store i64 %and, i64* %x ret void }
into
define void @foo_opt(i64 nocapture %x, i1 %c) local_unnamed_addr #0 { %val = load i64, i64 %x, align 16 %and = select i1 %c, i64 0, i64 %val store i64 %and, i64* %x, align 4 ret void }
On the IR level, this is a clear win, but for architectures that do not have good support for select
, lowering the select is more expensive than the original sequence.
For example, on AArch64, we generate the following for the select version
ldr x8, [x0]
tst w1, #​0x1
csel x8, xzr, x8, ne
str x8, [x0]
ret
and the following for the non-select version
ldr x8, [x0]
// kill: def $w1 killed $w1 def $x1
and x9, x1, #​0x1
sub x9, x9, #​1 // =1
and x8, x9, x8
str x8, [x0]
ret
Maybe we should estimate the cost of selects for the target when doing such transforms?
Alternatively, as it can be observed from -ast-dump, we did match this as X86ISD::CMOV first, but then X86CmovConverter expanded it; perhaps said expansion heuristics should be tuned?
I'm pretty sure I don't want CMOV here at all. This code is trying to be constant time and constant cache.
Alternatively, as it can be observed from -ast-dump, we did match this as X86ISD::CMOV first, but then X86CmovConverter expanded it; perhaps said expansion heuristics should be tuned?
So basically, we should be doing the following transform in backend: https://rise4fun.com/Alive/zPA
Name: naive %r = select i1 %c, i64 0, i64 %val => %ext = zext i1 %c to i64 %sub = sub i64 %ext, 1 %r = and i64 %sub, %val
BUT. It's illegal if %val is poison. We do have poison in backend, that is not middle-end only concern, right?
Extended Description
First, I'm not sure if I've picked the correct component. Please move it elsewhere if needed.
And example of the problem is here:
https://godbolt.org/z/mTzWxX
Notice that the optimized code introduces a conditional operation which isn't great given the fact that the input was crafted to avoid conditional operations.
I ran this through llc over a loop of architectures and this problem occurs on all architectures.