math/bits: guarantee Add, Sub, Mul, RotateLeft, ReverseBytes to be constant-time operations

TuomLarsen commented 5 years ago

I'm a heavy non-crypto user of bits.* and I'm afraid changes like #31229 will effect my performance. One way of keeping crypto people happy and users like me would be to have separate functions for both non-constant time and constant time addition, multiplication, ...

In the issue above I was reminded that the change affects only the fallbacks. I'm afraid promising that the fallbacks are constant time will later lead to constant time native implementations which could very well be slower than their non-constant time alternatives. For example, platforms which don't support carry flag are RISC-V and WebAssembly and it could happened that "a branch works out better" [1].

[1] https://gmplib.org/manual/Assembly-Carry-Propagation.html

randall77 commented 5 years ago

RISC-V can generate the carry in constant time using a SLTU instruction. Web assembly has a similar instruction. I don't know whether it guarantees constant time or not.

smasher164 commented 5 years ago

The change for #31229 results in a neglible (if any) impact on performance. Also keep in mind that the software fallback would rarely be executed. Here is the output for benchstat:

name             old time/op  new time/op  delta
Add-4            1.44ns ± 0%  1.63ns ± 0%   ~     (p=1.000 n=1+1)
Add32-4          1.08ns ± 0%  1.05ns ± 0%   ~     (p=1.000 n=1+1)
Add64-4          1.46ns ± 0%  1.43ns ± 0%   ~     (p=1.000 n=1+1)
Add64multiple-4  0.73ns ± 0%  0.99ns ± 0%   ~     (p=1.000 n=1+1)
Sub-4            1.42ns ± 0%  1.42ns ± 0%   ~     (all equal)
Sub32-4          1.08ns ± 0%  1.47ns ± 0%   ~     (p=1.000 n=1+1)
Sub64-4          1.42ns ± 0%  1.47ns ± 0%   ~     (p=1.000 n=1+1)
Sub64multiple-4  0.72ns ± 0%  0.72ns ± 0%   ~     (p=1.000 n=1+1)

bmkessler commented 5 years ago

@smasher164 FYI, that benchmark looks to be testing the intrinsic path and not the pure Go fallback. When you test the pure Go version, you should run multiple counts to compare significant changes.

If carries are unpredictable (adding random numbers), then the constant time (branchless) code will likely be faster than the prior implementation because the carry branches will be poorly predicted.

smasher164 commented 5 years ago

name             old time/op  new time/op  delta
Add-4            1.16ns ±11%  1.51ns ± 5%  +30.52%  (p=0.008 n=5+5)
Add32-4          1.08ns ± 0%  1.03ns ± 1%   -4.86%  (p=0.029 n=4+4)
Add64-4          1.09ns ± 1%  1.95ns ± 3%  +79.23%  (p=0.008 n=5+5)
Add64multiple-4  4.03ns ± 1%  4.55ns ±11%  +13.07%  (p=0.008 n=5+5)
Sub-4            1.08ns ± 1%  1.50ns ± 0%  +38.17%  (p=0.016 n=5+4)
Sub32-4          1.09ns ± 2%  1.53ns ±10%  +40.26%  (p=0.008 n=5+5)
Sub64-4          1.10ns ± 1%  1.47ns ± 1%  +33.39%  (p=0.008 n=5+5)
Sub64multiple-4  4.30ns ± 2%  4.08ns ± 4%   -5.07%  (p=0.032 n=5+5)

It seems that Add64 is the most significantly impacted benchmark, while the others have a ~30% slowdown.

FiloSottile commented 5 years ago

I prefer making the default secure, and if necessary offer a VariableTimeAdd which communicates the risk. This also has the advantage of letting us cross that bridge when and if we come to it. (That is, if it ever turns out a native implementation would actually be faster if not constant time.)

TuomLarsen commented 5 years ago

If Add64 is implemented as ADC on platforms which have such instruction, there is nothing "variable" in it. I think "constant" communicates better that it is always a constant time operation, no matter if ADC or a fallback is used.

FiloSottile commented 5 years ago

For someone doing a code review, it's reasonable to expect bits.Add to be constant time, and it requires prior knowledge to know that no, you should use bits.ConstantTimeAdd instead. Making bits.Add the constant time one solves this problem. No one will be upset to find out that in fact VariableTimeAdd is constant time sometimes, the other way around is not true.

When there are a safe and an unsafe API, the unsafe one must be the qualified one.

TuomLarsen commented 5 years ago

It is only true for crypto, constant time has no use for every other domain. By that perspective it's reasonable to expect that bits.Add adds two numbers, constant time or not.

smasher164 commented 5 years ago

The choice of preserving semantics and timing by default should depend on the chance that the fallback is executed. For example, the FMA intrinsic proposed in #25819 has a decent chance of hitting the fallback on x86, arm, mips, and mips64. Additionally, the cost of using the FMA software fallback is orders of magnitude slower than a simple multiply and add. On the other hand, the chance of the bits.Add fallback being executed is much lower, given the add-with-carry instructions on most architectures. Additionally, the slowdown introduced by a constant time fallback is in the worst-case a factor of two.

FiloSottile commented 5 years ago

I find it appropriate that the FMA functions live in math, while Add, Sub and Mul live in math/bits.

That allows us to build the expectation that math/bits is constant time unless the name is screaming otherwise, which is not true of math, or math/big.

I am firmly convinced that even if safety is only relevant to one use case, the default should be safe, and an unsafe version can be added with a clear name. So if anyone thinks it's already time to add VariableTimeAdd, VariableTimeSub and VariableTimeMul, we can make this proposal about that. Otherwise, I think we should close this for now.

jimmyfrasche commented 5 years ago

@FiloSottile operations in math/bits are not documented as being constant time so I did not have that expectation, though it makes sense now that you say it. Should it be mentioned in the package documentation?

TuomLarsen commented 5 years ago

@smasher164 The chance of hitting a fallback might change over time, as hardware advances. I'm thinking about FMA the other way around: the compiler should do with a*b+c whatever it pleases but when explicitly asking for FMA it should do just one rounding. Meaning Add should do whatever is needed to make it fast, and ConstantTimeAdd would guarantee what it says.

@FiloSottile Obviously we don't agree on this one but I find you want to close it way too early, given how controversial the issue is (based on number of up- and down-votes). I would also find interesting what other core devs think about this.

Regarding of whether to call something "constant time" or "variable time". I think the most common use case should take precedence in naming to the special case. For example, a "hash function" is usually called just that as evidenced on Wikipedia, whereas to signify the importance of other properties, one usually calls it "cryptographic hash function". Meaning Add should do whatever compiler pleases to make it performant and only "crypto" Add should guarantee constant time operation.

So far we only discussed Add and Mul but there is also Div. At least on z86 DIVQ instruction has variable latency so I would assume it always needs to call a fallback, in order to preserve constant time performance.

One of the goals of #21835 @robpike stated is to "make math/bits support its operations more efficiently through the compiler". This will make it less performant, not more. In the same thread, @josharian implemented PCG using math/bits and there is no need for constant time Add as the generator is not intended for crypto in the first place.

randall77 commented 5 years ago

The chance of hitting a fallback might change over time, as hardware advances.

That chance is only going to go down. Once the intrinsic is implemented, it isn't going to be reverted (unless an instruction is removed from an instruction set, which never happens).

I would also find interesting what other core devs think about this.

I see this as pretty much a non-issue. Practically speaking, no one is using fallbacks (and if they are, it isn't hard to contribute intrinsics). Processors today need to provide constant-time implementations of these operations for crypto libraries anyway, we might as well use them. And if you look at, say, the assembly code for math/big, all that assembly uses those constant-time implementations - there isn't a faster non-constant-time version that math/big feels like it needs to use.

Sure, theoretically, a new faster non-constant-time multiply might show up in x86 next year, or some new popular processor architecture might show up that only has non-constant-time ops. But both of those seem to me like really unlikely events that we shouldn't design our systems around.

So far we only discussed Add and Mul but there is also Div. At least on z86 DIVQ instruction has variable latency so I would assume it always needs to call a fallback, in order to preserve constant time performance.

I agree that div is a special case, and we probably don't want to (can't?) guarantee constant time.

TuomLarsen commented 5 years ago

Or, let the constant time operations live in crypto. There is already hash/fnv and there is crypto/sha1, math/rand and crypto/rand. And of course crypto/subtle, with ConstantTimeCompare, ConstantTimeCopy, ConstantTimeLessOrEq, ... There could also very well be math/bits and crypto/bits as they similarly serve different purposes. Or they could share the space in subtle with other ContantTime* functions.

rsc commented 5 years ago

"Security" is not a good enough reason to penalize real Go code with an 80% slowdown (as in the Add64 in the CL).

That said, the intention from the start was that you'd get intrinsics in any real Go code. Any time the fallback is too slow the answer is certainly "get intrinsics added on that architecture." Slowing down the fallback by 80% to match the constant-time semantics of the intrinsics is OK. But should we define that the intrinsics are constant-time?

The open question then is whether defining these to be constant time would fundamentally slow down any intrinsic implementation (perhaps one yet to be written). It doesn't seem like it should. Can someone confirm that intrinsics would not be affected at all (even on other systems yet to be implemented) if we change the definition of math/bits to be constant-time for these operations?

Thanks.

randall77 commented 5 years ago

I can confirm that all of our architectures have instructions that can be used for constant time add and sub (with the caveat that it's hard to tell whether wasm ops are really constant time). Those are the same instructions as would be used for cases that didn't need constant time (in math/big, say).

x86 intrinsics do constant time mul. I don't know for sure about other architectures, but my suspicion is they do. Again, these are the same instructions as would be used in non-constant-time situations. (If mul wasn't constant time - what would crypto libraries use?)

The slowdown is a lot less than 80%. That number might be old, before some optimizations went in. Disabling intrinsics and comparing CL 170758 before and after, I get much more equivocal results:

name             old time/op  new time/op  delta
Add-8            1.04ns ± 1%  1.30ns ± 1%  +25.79%         (p=0.000 n=9+10)
Add32-8          1.04ns ± 0%  0.79ns ± 1%  -24.44%          (p=0.000 n=7+9)
Add64-8          1.04ns ± 1%  1.31ns ± 0%  +25.48%         (p=0.000 n=10+8)
Add64multiple-8  4.52ns ± 2%  2.88ns ± 1%  -36.24%         (p=0.000 n=10+8)
Sub-8            1.04ns ± 2%  1.30ns ± 1%  +24.23%        (p=0.000 n=10+10)
Sub32-8          1.04ns ± 0%  1.04ns ± 0%     ~     (all samples are equal)
Sub64-8          1.05ns ± 0%  1.04ns ± 0%   -0.95%          (p=0.006 n=7+9)
Sub64multiple-8  4.29ns ± 3%  2.85ns ± 0%  -33.50%          (p=0.000 n=9+8)

randall77 commented 5 years ago

Those benchmarks also favor the branching implementations, because the branch is perfectly predictable. In real world uses, the branch is perfectly random.

FiloSottile commented 5 years ago

"Security" is not a good enough reason to penalize real Go code with an 80% slowdown (as in the Add64 in the CL).

@rsc Again, I am not arguing not to provide the fast option. My argument is that if we need to have two intrinsics, one constant time and one variable time, the default one should be the safe one, and the fast and unsafe one should be the qualified one, not the other way around.

If you have a "slow" Add in a hot loop, the profiler will tell you, and you can go replace it with a faster unsafe version if appropriate. If you have a variable time Add in a crypto implementation, nothing will tell you.

Anyway, this sounds moot because it looks like there are not in fact two classes of intrinsics.

crvv commented 5 years ago

I think the point here is not just about Add, Mul or Div. I guess no one is worry about the speed of the constant-time Add or Mul. And Div is not the only one that the constant-time algorithm is not trivial.

The question is whether math/bits should be guaranteed to be constant-time.

There are some options.

All functions in math/bits are constant-time. The downside is that some functions like Div will be very slow. This is the point of the issue.
Some functions in math/bits are constant-time and some are not. It will be documented.
No function in math/bits is guaranteed to be constant-time.
Add two div functions, Div and ConstantTimeDiv. And there is only one Add.
Add two div functions, VariableTimeDiv and Div. And there is only one Add.
Add two functions for all operations in math/bits. There will be Add and ConstantTimeAdd.
Add two functions for all operations in math/bits. There will be VariableTimeAdd and Add.

Obviously, 1 is not OK. 2 and 4 are not safe. For 5, almost everyone will use bits.VariableTimeDiv, which is kind of weird. And I need to remember that there is bits.VariableTimeDiv but there isn't bits.VaraibelTimeAdd. I think 6 and 7 are worse than two packages, maybe math/bits and crypto/bits.

So I think 3 is the acceptable option. Maybe a new package crypto/bits will be added later.

crvv commented 5 years ago

If you have a variable time Add in a crypto implementation, nothing will tell you.

There are some discussions about how to test whether the code is constant-time. https://github.com/golang/go/issues/20654#issuecomment-313798127

TuomLarsen commented 5 years ago

@rsc I double-checked with Agner Fog's "Instruction tables" and can confirm that there is no constant-time Div64 instruction in any of the recent microarchitectures by Intel or AMD, nor in ARM Cortex A57/A72 according to their "Software Optimization Guide", which includes the following:

Integer divides are performed using a iterative algorithm and block any subsequent divide operations until complete. Early termination is possible, depending upon the data values.

robpike commented 5 years ago

One option would be to have a separate constant-time package.

FiloSottile commented 5 years ago

I feel like we are solving a problem that does not exist yet. It looks like all current architectures have fast, constant-time instructions for implementing intrinsics. That should make the fallbacks nearly irrelevant, so it shouldn't matter their performance.

I would be more than happy with a package comment like the following

// All operations except Div are implemented in constant time.
// Future additions might not be constant time.

This avoids locking ourselves into an answer for now, so we can make a decision later with more information if something changes, instead of speculating. And in the meantime no cryptography implementor expects division to be constant time anyway.

TuomLarsen commented 5 years ago

One either guarantees that a function or package is constant time, or one does not. "Future additions might" means that backward compatibility could be broken, which of course cannot be allowed. The very issue is about "locking ourselves into an answer" and a much better solution is to let crypto functions live in crypto package.

crvv commented 5 years ago

Actually, this constant-time algorithm in math/bits was discussed and rejected before. https://github.com/golang/go/issues/18616#issuecomment-273919274

The reason to reject is simple. math/bits is for performance, but generally, the constant-time algorithm is not the fast algorithm. Besides Div, the constant-time algorithm for LeadingZeros is not trivial. I don't know whether it is faster or slower than the current one. But it is also an issue.

crvv commented 5 years ago

As for constant-time cryptography code, I think it is much important to have the tool to test whether the code is constant-time. With that tool, anyone can use any library to write the code and verify that the code is constant-time. Without that tool, a constant-time math/bits isn't very helpful because it is still very difficult to verify the code for the entire encryption is constant-time.

crvv commented 5 years ago

And this issue is an example which https://github.com/golang/go/issues/17373 is trying to solve.

the Go 1 compatibility makes designing good abstraction pretty hard, and because the nature of this problem, we can't design the API in a package in a subrepo (e.g. x/simd) like context or http2 and later move it into std when it's mature.

Because user-defined assembly functions can't be intrinsicified, the code must be placed in stdlib. Then it is controversial because it becomes a compatibility issue.

FiloSottile commented 5 years ago

As far as I know, making a tool to ensure constant time code is an unsolved problem in cryptography engineering, at least in the general case. The attempts I know about are also dynamic, not static, so not really an option for cross-platform development.

Also, the Go 1 Compatibility Promise applies to functions, not packages, so "Future additions might not be constant time." is not at all a violation, as long as the functions that exist stay constant time.

crvv commented 5 years ago

I agree that it is not a violation. But I don't think it is a good idea to have "Future additions might not be constant time" in the document.

And compatibility promise means the functions must be constant-time forever. If someone comes up with a faster LeadingZeros algorithm, the problem becomes Is that acceptable to have a new faster VariableTimeLeadingZeros?

rsc commented 5 years ago

It doesn't sound like we should say "All but these" are constant time. Enumerate the ones we want to be constant-time, document each one as such, and then don't worry about having to say "Future ones may not fit the blanket rule." It doesn't seem like LeadingZeros64 is particularly easy to do in constant time, for example (absent hardware support).

It would be fine to just start with Add, Sub, Mul (certainly not Div). Filippo, are there others that are important?

FiloSottile commented 5 years ago

Sounds good to me. (Including sized variants, of course.)

We definitely use RotateLeft, too, and I can see a use for ReverseBytes. Their fallbacks are already constant time.

rsc commented 5 years ago

OK, then let's start with Add, Sub, Mul, RotateLeft, ReverseBytes for now (including sized variants). Accepted.

rsc commented 5 years ago

I removed the Documentation label because there is code work to do too in the fallbacks. I milestoned it Go 1.14, but if it goes into Go 1.13 that's fine too.

smasher164 commented 5 years ago

It seems that the ReverseBytes and RotateLeft fallbacks are already constant time. Should the change then modify Add and Sub, while documenting Add, Sub, Mul, RotateLeft, and ReverseBytes to be constant time?

gopherbot commented 5 years ago

Change https://golang.org/cl/170758 mentions this issue: math/bits: make Add and Sub fallbacks constant time

FiloSottile commented 5 years ago

Ah, right, reopening to document the property. I'll send a CL tomorrow.

gopherbot commented 5 years ago

Change https://golang.org/cl/178177 mentions this issue: math/bits: document that Add, Sub, Mul, RotateLeft, ReverseBytes are constant time

golang / go

math/bits: guarantee Add, Sub, Mul, RotateLeft, ReverseBytes to be constant-time operations #31267