Open JasonGross opened 4 years ago
Long-term, I think we need language support for this sort of thing. In the meantime, yeah, the compilers seem to have gotten smart enough, so we have a derpy function we can sprinkle in as needed. It's terrible and there is no principled answer for where to use it, but... so it goes. :-(
One more addition to a long list of reasons to eventually try to bypass the C compiler by producing something closer to machine code.
That too. :-) Although projects often need to support random generic targets, so anything close to machine code tends to be in addition to rather than instead of C. I'm hopeful the long-term answer is compiler support.
(Then again, value_barrier
doesn't work for no-asm builds anyway...)
@chjj said in #840 that
I figure @davidben's solution is probably better. In general, using a barrier doesn't seem to completely deoptimize code like
volatile
does, but it doesn't account for compilers which don't support GNU-style inline asm (though, I suppose it's assumed the compiler supports GNU extensions for any fiat backend which uses __int128). Assuming there is a compiler out there that optimizes as "well" as Clang without supporting GNU-style asm,volatile
might be a better solution.
Would doing something like
static inline uint64_t value_barrier_u64(uint64_t a) {
#if !defined(OPENSSL_NO_ASM) && (defined(__GNUC__) || defined(__clang__))
__asm__("" : "+r"(a) : /* no inputs */);
return a;
#else
volatile uint64_t b = a;
return b;
#endif
}
work?
@JasonGross, if it doesn't work, I'm almost certain this hideous thing will:
#if defined(__GNUC__) || defined(__clang__)
# define fiat_volatile
static __inline__ uint64_t fiat_barrier(uint64_t a) {
__asm__("" : "+r"(a) : /* no inputs */);
return a;
}
#else
# define fiat_volatile volatile
# define fiat_barrier(x) (x)
#endif
static void fiat_p256_cmovznz_u64(uint64_t* out1, fiat_p256_uint1 arg1, uint64_t arg2, uint64_t arg3) {
fiat_volatile fiat_p256_uint1 x1;
uint64_t x2;
uint64_t x3;
x1 = (!(!arg1));
x2 = ((fiat_p256_int1)(0x0 - x1) & UINT64_C(0xffffffffffffffff));
x3 = ((fiat_barrier(x2) & arg3) | (fiat_barrier(~x2) & arg2));
*out1 = x3;
}
(Aside: inline
is C99. I've changed it to __inline__
which should be acceptable since we already know the compiler supports GNU C.)
Will test and report back.
Changing variable types in specific locations to introduce volatile
is highly non-trivial. Adding an extra function wrapper in some places is significantly simpler.
@JasonGross, your solution seems to work well.
After playing around with it a lot, I would go with something like:
#if defined(__GNUC__) || defined(__clang__)
static __inline__ uint64_t fiat_p256_barrier(uint64_t x) {
__asm__ ("" : "+r" (x) ::);
return x;
}
#else
static uint64_t fiat_p256_barrier(uint64_t x) {
volatile uint64_t y = x;
return y;
}
#endif
static void fiat_p256_cmovznz_u64(uint64_t* out1, fiat_p256_uint1 arg1, uint64_t arg2, uint64_t arg3) {
fiat_p256_uint1 x1;
uint64_t x2;
uint64_t x3;
x1 = (!(!arg1));
x2 = ((fiat_p256_int1)(0x0 - x1) & UINT64_C(0xffffffffffffffff));
x3 = ((fiat_p256_barrier(x2) & arg3) | (fiat_p256_barrier(~x2) & arg2));
*out1 = x3;
}
For reference, here is the code clang is generating with -O3
:
No barrier:
0000000000022b00 <fiat_p256_cmovznz_u64>:
22b00: 40 84 f6 test %sil,%sil
22b03: 48 0f 45 d1 cmovne %rcx,%rdx
22b07: 48 89 17 mov %rdx,(%rdi)
22b0a: c3 retq
Inline ASM barrier:
0000000000022b00 <fiat_p256_cmovznz_u64>:
22b00: 40 f6 de neg %sil
22b03: 48 19 c0 sbb %rax,%rax
22b06: 48 89 c6 mov %rax,%rsi
22b09: 48 21 ce and %rcx,%rsi
22b0c: 48 f7 d0 not %rax
22b0f: 48 21 d0 and %rdx,%rax
22b12: 48 09 f0 or %rsi,%rax
22b15: 48 89 07 mov %rax,(%rdi)
22b18: c3 retq
Volatile barrier:
0000000000022b00 <fiat_p256_cmovznz_u64>:
22b00: 40 f6 de neg %sil
22b03: 48 19 c0 sbb %rax,%rax
22b06: 48 89 44 24 f8 mov %rax,-0x8(%rsp)
22b0b: 48 23 4c 24 f8 and -0x8(%rsp),%rcx
22b10: 48 f7 d0 not %rax
22b13: 48 89 44 24 f8 mov %rax,-0x8(%rsp)
22b18: 48 23 54 24 f8 and -0x8(%rsp),%rdx
22b1d: 48 09 ca or %rcx,%rdx
22b20: 48 89 17 mov %rdx,(%rdi)
22b23: c3 retq
Volatile modifier on variable declaration:
0000000000022b00 <fiat_p256_cmovznz_u64>:
22b00: 40 f6 de neg %sil
22b03: 48 19 c0 sbb %rax,%rax
22b06: 48 89 44 24 f0 mov %rax,-0x10(%rsp)
22b0b: 48 23 4c 24 f0 and -0x10(%rsp),%rcx
22b10: 48 8b 44 24 f0 mov -0x10(%rsp),%rax
22b15: 48 f7 d0 not %rax
22b18: 48 21 d0 and %rdx,%rax
22b1b: 48 09 c8 or %rcx,%rax
22b1e: 48 89 44 24 f8 mov %rax,-0x8(%rsp)
22b23: 48 8b 44 24 f8 mov -0x8(%rsp),%rax
22b28: 48 89 07 mov %rax,(%rdi)
22b2b: c3 retq
The inline ASM barrier definitely seems to be the best, the volatile keyword is the worst, with the "volatile barrier" somewhere in the middle.
It might be worth just using the "volatile barrier" to avoid inline assembly completely, unless we really care about eliminating that one extra instruction.
edit: I suppose the inline asm barrier still has an advantage since it operates entirely on registers whereas the volatile barrier is messing with stuff on the stack. Not sure how much overhead that adds. Who knows if it's the same when it's inlined. I'd have to investigate that more deeply to find out.
I've been thinking about it more, and I think another solution is to obfuscate the code a bit.
I believe the real problem is this line:
x1 = (!(!arg1));
Clang knows for certain that x1
is either 1 or 0 due to the boolean cast. Its optimization hinges entirely on this assumption.
Removing the cast:
x1 = arg1;
Generates constant-time code.
Of course, we would need to ensure that arg1
is always either 1 or 0 (which seems to be the case in practice, but perhaps more needs to be done here?).
Looking at the function in isolation won't work. While, in isolation, the compiler may not know what's going on, inlining will give the optimizer visibility into the caller, where it could figure this out. Putting it in a separate file won't work either because of LTO.
We got the empty assembly trick from a Clang developer, as a reasonably optimizer-stable way to apply unprincipled point fixes as problems come up. Nothing here is going to be principled without language support for constant-time code. The best we can do is find things that are reasonably stable across optimizer changes and don't completely destroy performance, so we can apply point fixes as issues come up, until the language support is available.
To that end, the volatile
thing seems quite undesirable as it'll spill things to memory when it works, and do nothing when it doesn't. (What volatile
even means for local variables, I have no idea. I think it's mostly for setjmp
and longjmp
silliness, as an artifact of history.)
@davidben, yeah, I realize now that some of my thinking has been flawed here. I've been testing this with a ctgrind test suite to avoid the some of the issues you point out, but there's probably no general way to account for how these functions will be called in practice.
Case in point: I actually realize now that the library I'm using to test this is relatively unique in that it calls fiat functions via function pointers assigned at runtime, which undoubtedly prevents inlining of any kind. While my library does that, some other library will likely call them directly. I imagine someone directly calling fiat_p256_selectznz
in a situation where the optimizer could determine the bounds of arg1
(i.e. 0 or 1) could invalidate my last proposed "fix".
the volatile thing seems quite undesirable as it'll spill things to memory when it works, and do nothing when it doesn't.
I see. I suppose the asm value barrier alone is the best stop-gap fix then.
until the language support is available
Are there any proposals for this currently? This seems like it would be a tricky thing to standardize.
So is there something that prevents this from getting fixed?
I've been holding off on this because it looked like it was a stopgap measure, but I'll go prepare a PR
Originally posted by @davidben in https://github.com/mit-plv/fiat-crypto/pull/723#issuecomment-613639650