How can/should we get good constant-time C code for cmov?

JasonGross commented 4 years ago

I should footnote, those numbers are for x86_64. They also include the following patch:

static void fiat_p256_cmovznz_u64(uint64_t* out1, fiat_p256_uint1 arg1, uint64_t arg2, uint64_t arg3) {
  fiat_p256_uint1 x1 = (!(!arg1));
  uint64_t x2 = ((fiat_p256_int1)(0x0 - x1) & UINT64_C(0xffffffffffffffff));
  // Note this line has been patched from the synthesized code to add value
  // barriers.
  //
  // Clang recognizes this pattern as a select. While it usually transforms it
  // to a cmov, it sometimes further transforms it into a branch, which we do
  // not want.
  uint64_t x3 = ((value_barrier_u64(x2) & arg3) | (value_barrier_u64(~x2) & arg2));
  *out1 = x3;
}

value_barrier_u64 is defined as below. It's a goofball hack to stop the compiler from turning cmovs back into branches, because apparently it's doing that now. We really need language-level support for this stuff. :-(

static inline uint64_t value_barrier_u64(uint64_t a) {
#if !defined(OPENSSL_NO_ASM) && (defined(__GNUC__) || defined(__clang__))
  __asm__("" : "+r"(a) : /* no inputs */);
#endif
  return a;
}

Originally posted by @davidben in https://github.com/mit-plv/fiat-crypto/pull/723#issuecomment-613639650

davidben commented 4 years ago

Long-term, I think we need language support for this sort of thing. In the meantime, yeah, the compilers seem to have gotten smart enough, so we have a derpy function we can sprinkle in as needed. It's terrible and there is no principled answer for where to use it, but... so it goes. :-(

jadephilipoom commented 4 years ago

One more addition to a long list of reasons to eventually try to bypass the C compiler by producing something closer to machine code.

davidben commented 4 years ago

That too. :-) Although projects often need to support random generic targets, so anything close to machine code tends to be in addition to rather than instead of C. I'm hopeful the long-term answer is compiler support.

davidben commented 4 years ago

(Then again, value_barrier doesn't work for no-asm builds anyway...)

JasonGross commented 4 years ago

@chjj said in #840 that

I figure @davidben's solution is probably better. In general, using a barrier doesn't seem to completely deoptimize code like volatile does, but it doesn't account for compilers which don't support GNU-style inline asm (though, I suppose it's assumed the compiler supports GNU extensions for any fiat backend which uses __int128). Assuming there is a compiler out there that optimizes as "well" as Clang without supporting GNU-style asm, volatile might be a better solution.

Would doing something like

static inline uint64_t value_barrier_u64(uint64_t a) {
#if !defined(OPENSSL_NO_ASM) && (defined(__GNUC__) || defined(__clang__))
  __asm__("" : "+r"(a) : /* no inputs */);
  return a;
#else
  volatile uint64_t b = a;
  return b;
#endif
}

work?

chjj commented 4 years ago

@JasonGross, if it doesn't work, I'm almost certain this hideous thing will:

#if defined(__GNUC__) || defined(__clang__)
#  define fiat_volatile
static __inline__ uint64_t fiat_barrier(uint64_t a) {
  __asm__("" : "+r"(a) : /* no inputs */);
  return a;
}
#else
#  define fiat_volatile volatile
#  define fiat_barrier(x) (x)
#endif

static void fiat_p256_cmovznz_u64(uint64_t* out1, fiat_p256_uint1 arg1, uint64_t arg2, uint64_t arg3) {
  fiat_volatile fiat_p256_uint1 x1;
  uint64_t x2;
  uint64_t x3;
  x1 = (!(!arg1));
  x2 = ((fiat_p256_int1)(0x0 - x1) & UINT64_C(0xffffffffffffffff));
  x3 = ((fiat_barrier(x2) & arg3) | (fiat_barrier(~x2) & arg2));
  *out1 = x3;
}

(Aside: inline is C99. I've changed it to __inline__ which should be acceptable since we already know the compiler supports GNU C.)

Will test and report back.

JasonGross commented 4 years ago

Changing variable types in specific locations to introduce volatile is highly non-trivial. Adding an extra function wrapper in some places is significantly simpler.

chjj commented 4 years ago

@JasonGross, your solution seems to work well.

After playing around with it a lot, I would go with something like:

#if defined(__GNUC__) || defined(__clang__)
static __inline__ uint64_t fiat_p256_barrier(uint64_t x) {
  __asm__ ("" : "+r" (x) ::);
  return x;
}
#else
static uint64_t fiat_p256_barrier(uint64_t x) {
  volatile uint64_t y = x;
  return y;
}
#endif

static void fiat_p256_cmovznz_u64(uint64_t* out1, fiat_p256_uint1 arg1, uint64_t arg2, uint64_t arg3) {
  fiat_p256_uint1 x1;
  uint64_t x2;
  uint64_t x3;
  x1 = (!(!arg1));
  x2 = ((fiat_p256_int1)(0x0 - x1) & UINT64_C(0xffffffffffffffff));
  x3 = ((fiat_p256_barrier(x2) & arg3) | (fiat_p256_barrier(~x2) & arg2));
  *out1 = x3;
}

For reference, here is the code clang is generating with -O3:

No barrier:

0000000000022b00 <fiat_p256_cmovznz_u64>:
   22b00:       40 84 f6                test   %sil,%sil
   22b03:       48 0f 45 d1             cmovne %rcx,%rdx
   22b07:       48 89 17                mov    %rdx,(%rdi)
   22b0a:       c3                      retq

Inline ASM barrier:

0000000000022b00 <fiat_p256_cmovznz_u64>:
   22b00:       40 f6 de                neg    %sil
   22b03:       48 19 c0                sbb    %rax,%rax
   22b06:       48 89 c6                mov    %rax,%rsi
   22b09:       48 21 ce                and    %rcx,%rsi
   22b0c:       48 f7 d0                not    %rax
   22b0f:       48 21 d0                and    %rdx,%rax
   22b12:       48 09 f0                or     %rsi,%rax
   22b15:       48 89 07                mov    %rax,(%rdi)
   22b18:       c3                      retq

Volatile barrier:

0000000000022b00 <fiat_p256_cmovznz_u64>:
   22b00:       40 f6 de                neg    %sil
   22b03:       48 19 c0                sbb    %rax,%rax
   22b06:       48 89 44 24 f8          mov    %rax,-0x8(%rsp)
   22b0b:       48 23 4c 24 f8          and    -0x8(%rsp),%rcx
   22b10:       48 f7 d0                not    %rax
   22b13:       48 89 44 24 f8          mov    %rax,-0x8(%rsp)
   22b18:       48 23 54 24 f8          and    -0x8(%rsp),%rdx
   22b1d:       48 09 ca                or     %rcx,%rdx
   22b20:       48 89 17                mov    %rdx,(%rdi)
   22b23:       c3                      retq

Volatile modifier on variable declaration:

0000000000022b00 <fiat_p256_cmovznz_u64>:
   22b00:       40 f6 de                neg    %sil
   22b03:       48 19 c0                sbb    %rax,%rax
   22b06:       48 89 44 24 f0          mov    %rax,-0x10(%rsp)
   22b0b:       48 23 4c 24 f0          and    -0x10(%rsp),%rcx
   22b10:       48 8b 44 24 f0          mov    -0x10(%rsp),%rax
   22b15:       48 f7 d0                not    %rax
   22b18:       48 21 d0                and    %rdx,%rax
   22b1b:       48 09 c8                or     %rcx,%rax
   22b1e:       48 89 44 24 f8          mov    %rax,-0x8(%rsp)
   22b23:       48 8b 44 24 f8          mov    -0x8(%rsp),%rax
   22b28:       48 89 07                mov    %rax,(%rdi)
   22b2b:       c3                      retq

The inline ASM barrier definitely seems to be the best, the volatile keyword is the worst, with the "volatile barrier" somewhere in the middle.

It might be worth just using the "volatile barrier" to avoid inline assembly completely, unless we really care about eliminating that one extra instruction.

edit: I suppose the inline asm barrier still has an advantage since it operates entirely on registers whereas the volatile barrier is messing with stuff on the stack. Not sure how much overhead that adds. Who knows if it's the same when it's inlined. I'd have to investigate that more deeply to find out.

chjj commented 4 years ago

I've been thinking about it more, and I think another solution is to obfuscate the code a bit.

I believe the real problem is this line:

  x1 = (!(!arg1));

Clang knows for certain that x1 is either 1 or 0 due to the boolean cast. Its optimization hinges entirely on this assumption.

Removing the cast:

  x1 = arg1;

Generates constant-time code.

Of course, we would need to ensure that arg1 is always either 1 or 0 (which seems to be the case in practice, but perhaps more needs to be done here?).

davidben commented 4 years ago

Looking at the function in isolation won't work. While, in isolation, the compiler may not know what's going on, inlining will give the optimizer visibility into the caller, where it could figure this out. Putting it in a separate file won't work either because of LTO.

We got the empty assembly trick from a Clang developer, as a reasonably optimizer-stable way to apply unprincipled point fixes as problems come up. Nothing here is going to be principled without language support for constant-time code. The best we can do is find things that are reasonably stable across optimizer changes and don't completely destroy performance, so we can apply point fixes as issues come up, until the language support is available.

To that end, the volatile thing seems quite undesirable as it'll spill things to memory when it works, and do nothing when it doesn't. (What volatile even means for local variables, I have no idea. I think it's mostly for setjmp and longjmp silliness, as an artifact of history.)

chjj commented 4 years ago

@davidben, yeah, I realize now that some of my thinking has been flawed here. I've been testing this with a ctgrind test suite to avoid the some of the issues you point out, but there's probably no general way to account for how these functions will be called in practice.

Case in point: I actually realize now that the library I'm using to test this is relatively unique in that it calls fiat functions via function pointers assigned at runtime, which undoubtedly prevents inlining of any kind. While my library does that, some other library will likely call them directly. I imagine someone directly calling fiat_p256_selectznz in a situation where the optimizer could determine the bounds of arg1 (i.e. 0 or 1) could invalidate my last proposed "fix".

the volatile thing seems quite undesirable as it'll spill things to memory when it works, and do nothing when it doesn't.

I see. I suppose the asm value barrier alone is the best stop-gap fix then.

until the language support is available

Are there any proposals for this currently? This seems like it would be a tricky thing to standardize.

kroeckx commented 4 years ago

So is there something that prevents this from getting fixed?

JasonGross commented 4 years ago

I've been holding off on this because it looked like it was a stopgap measure, but I'll go prepare a PR

mit-plv / fiat-crypto

How can/should we get good constant-time C code for cmov? #748