Closed TuomLarsen closed 5 years ago
RISC-V can generate the carry in constant time using a SLTU instruction. Web assembly has a similar instruction. I don't know whether it guarantees constant time or not.
The change for #31229 results in a neglible (if any) impact on performance. Also keep in mind that the software fallback would rarely be executed. Here is the output for benchstat:
name old time/op new time/op delta
Add-4 1.44ns ± 0% 1.63ns ± 0% ~ (p=1.000 n=1+1)
Add32-4 1.08ns ± 0% 1.05ns ± 0% ~ (p=1.000 n=1+1)
Add64-4 1.46ns ± 0% 1.43ns ± 0% ~ (p=1.000 n=1+1)
Add64multiple-4 0.73ns ± 0% 0.99ns ± 0% ~ (p=1.000 n=1+1)
Sub-4 1.42ns ± 0% 1.42ns ± 0% ~ (all equal)
Sub32-4 1.08ns ± 0% 1.47ns ± 0% ~ (p=1.000 n=1+1)
Sub64-4 1.42ns ± 0% 1.47ns ± 0% ~ (p=1.000 n=1+1)
Sub64multiple-4 0.72ns ± 0% 0.72ns ± 0% ~ (p=1.000 n=1+1)
@smasher164 FYI, that benchmark looks to be testing the intrinsic path and not the pure Go fallback. When you test the pure Go version, you should run multiple counts to compare significant changes.
If carries are unpredictable (adding random numbers), then the constant time (branchless) code will likely be faster than the prior implementation because the carry branches will be poorly predicted.
name old time/op new time/op delta
Add-4 1.16ns ±11% 1.51ns ± 5% +30.52% (p=0.008 n=5+5)
Add32-4 1.08ns ± 0% 1.03ns ± 1% -4.86% (p=0.029 n=4+4)
Add64-4 1.09ns ± 1% 1.95ns ± 3% +79.23% (p=0.008 n=5+5)
Add64multiple-4 4.03ns ± 1% 4.55ns ±11% +13.07% (p=0.008 n=5+5)
Sub-4 1.08ns ± 1% 1.50ns ± 0% +38.17% (p=0.016 n=5+4)
Sub32-4 1.09ns ± 2% 1.53ns ±10% +40.26% (p=0.008 n=5+5)
Sub64-4 1.10ns ± 1% 1.47ns ± 1% +33.39% (p=0.008 n=5+5)
Sub64multiple-4 4.30ns ± 2% 4.08ns ± 4% -5.07% (p=0.032 n=5+5)
It seems that Add64 is the most significantly impacted benchmark, while the others have a ~30% slowdown.
I prefer making the default secure, and if necessary offer a VariableTimeAdd
which communicates the risk. This also has the advantage of letting us cross that bridge when and if we come to it. (That is, if it ever turns out a native implementation would actually be faster if not constant time.)
If Add64
is implemented as ADC
on platforms which have such instruction, there is nothing "variable" in it. I think "constant" communicates better that it is always a constant time operation, no matter if ADC
or a fallback is used.
For someone doing a code review, it's reasonable to expect bits.Add
to be constant time, and it requires prior knowledge to know that no, you should use bits.ConstantTimeAdd
instead. Making bits.Add
the constant time one solves this problem. No one will be upset to find out that in fact VariableTimeAdd
is constant time sometimes, the other way around is not true.
When there are a safe and an unsafe API, the unsafe one must be the qualified one.
It is only true for crypto, constant time has no use for every other domain. By that perspective it's reasonable to expect that bits.Add
adds two numbers, constant time or not.
The choice of preserving semantics and timing by default should depend on the chance that the fallback is executed. For example, the FMA intrinsic proposed in #25819 has a decent chance of hitting the fallback on x86, arm, mips, and mips64. Additionally, the cost of using the FMA software fallback is orders of magnitude slower than a simple multiply and add. On the other hand, the chance of the bits.Add fallback being executed is much lower, given the add-with-carry instructions on most architectures. Additionally, the slowdown introduced by a constant time fallback is in the worst-case a factor of two.
I find it appropriate that the FMA functions live in math, while Add, Sub and Mul live in math/bits.
That allows us to build the expectation that math/bits is constant time unless the name is screaming otherwise, which is not true of math, or math/big.
I am firmly convinced that even if safety is only relevant to one use case, the default should be safe, and an unsafe version can be added with a clear name. So if anyone thinks it's already time to add VariableTimeAdd
, VariableTimeSub
and VariableTimeMul
, we can make this proposal about that. Otherwise, I think we should close this for now.
@FiloSottile operations in math/bits are not documented as being constant time so I did not have that expectation, though it makes sense now that you say it. Should it be mentioned in the package documentation?
@smasher164 The chance of hitting a fallback might change over time, as hardware advances. I'm thinking about FMA the other way around: the compiler should do with a*b+c
whatever it pleases but when explicitly asking for FMA
it should do just one rounding. Meaning Add
should do whatever is needed to make it fast, and ConstantTimeAdd
would guarantee what it says.
@FiloSottile Obviously we don't agree on this one but I find you want to close it way too early, given how controversial the issue is (based on number of up- and down-votes). I would also find interesting what other core devs think about this.
Regarding of whether to call something "constant time" or "variable time". I think the most common use case should take precedence in naming to the special case. For example, a "hash function" is usually called just that as evidenced on Wikipedia, whereas to signify the importance of other properties, one usually calls it "cryptographic hash function". Meaning Add
should do whatever compiler pleases to make it performant and only "crypto" Add
should guarantee constant time operation.
So far we only discussed Add
and Mul
but there is also Div
. At least on z86 DIVQ
instruction has variable latency so I would assume it always needs to call a fallback, in order to preserve constant time performance.
One of the goals of #21835 @robpike stated is to "make math/bits
support its operations more efficiently through the compiler". This will make it less performant, not more. In the same thread, @josharian implemented PCG using math/bits
and there is no need for constant time Add
as the generator is not intended for crypto in the first place.
The chance of hitting a fallback might change over time, as hardware advances.
That chance is only going to go down. Once the intrinsic is implemented, it isn't going to be reverted (unless an instruction is removed from an instruction set, which never happens).
I would also find interesting what other core devs think about this.
I see this as pretty much a non-issue. Practically speaking, no one is using fallbacks (and if they are, it isn't hard to contribute intrinsics). Processors today need to provide constant-time implementations of these operations for crypto libraries anyway, we might as well use them. And if you look at, say, the assembly code for math/big, all that assembly uses those constant-time implementations - there isn't a faster non-constant-time version that math/big feels like it needs to use.
Sure, theoretically, a new faster non-constant-time multiply might show up in x86 next year, or some new popular processor architecture might show up that only has non-constant-time ops. But both of those seem to me like really unlikely events that we shouldn't design our systems around.
So far we only discussed Add and Mul but there is also Div. At least on z86 DIVQ instruction has variable latency so I would assume it always needs to call a fallback, in order to preserve constant time performance.
I agree that div is a special case, and we probably don't want to (can't?) guarantee constant time.
Or, let the constant time operations live in crypto
. There is already hash/fnv
and there is crypto/sha1
, math/rand
and crypto/rand
. And of course crypto/subtle
, with ConstantTimeCompare
, ConstantTimeCopy
, ConstantTimeLessOrEq
, ... There could also very well be math/bits
and crypto/bits
as they similarly serve different purposes. Or they could share the space in subtle
with other ContantTime*
functions.
"Security" is not a good enough reason to penalize real Go code with an 80% slowdown (as in the Add64 in the CL).
That said, the intention from the start was that you'd get intrinsics in any real Go code. Any time the fallback is too slow the answer is certainly "get intrinsics added on that architecture." Slowing down the fallback by 80% to match the constant-time semantics of the intrinsics is OK. But should we define that the intrinsics are constant-time?
The open question then is whether defining these to be constant time would fundamentally slow down any intrinsic implementation (perhaps one yet to be written). It doesn't seem like it should. Can someone confirm that intrinsics would not be affected at all (even on other systems yet to be implemented) if we change the definition of math/bits to be constant-time for these operations?
Thanks.
I can confirm that all of our architectures have instructions that can be used for constant time add and sub (with the caveat that it's hard to tell whether wasm
ops are really constant time). Those are the same instructions as would be used for cases that didn't need constant time (in math/big, say).
x86 intrinsics do constant time mul. I don't know for sure about other architectures, but my suspicion is they do. Again, these are the same instructions as would be used in non-constant-time situations. (If mul wasn't constant time - what would crypto libraries use?)
The slowdown is a lot less than 80%. That number might be old, before some optimizations went in. Disabling intrinsics and comparing CL 170758 before and after, I get much more equivocal results:
name old time/op new time/op delta
Add-8 1.04ns ± 1% 1.30ns ± 1% +25.79% (p=0.000 n=9+10)
Add32-8 1.04ns ± 0% 0.79ns ± 1% -24.44% (p=0.000 n=7+9)
Add64-8 1.04ns ± 1% 1.31ns ± 0% +25.48% (p=0.000 n=10+8)
Add64multiple-8 4.52ns ± 2% 2.88ns ± 1% -36.24% (p=0.000 n=10+8)
Sub-8 1.04ns ± 2% 1.30ns ± 1% +24.23% (p=0.000 n=10+10)
Sub32-8 1.04ns ± 0% 1.04ns ± 0% ~ (all samples are equal)
Sub64-8 1.05ns ± 0% 1.04ns ± 0% -0.95% (p=0.006 n=7+9)
Sub64multiple-8 4.29ns ± 3% 2.85ns ± 0% -33.50% (p=0.000 n=9+8)
Those benchmarks also favor the branching implementations, because the branch is perfectly predictable. In real world uses, the branch is perfectly random.
"Security" is not a good enough reason to penalize real Go code with an 80% slowdown (as in the Add64 in the CL).
@rsc Again, I am not arguing not to provide the fast option. My argument is that if we need to have two intrinsics, one constant time and one variable time, the default one should be the safe one, and the fast and unsafe one should be the qualified one, not the other way around.
If you have a "slow" Add in a hot loop, the profiler will tell you, and you can go replace it with a faster unsafe version if appropriate. If you have a variable time Add in a crypto implementation, nothing will tell you.
Anyway, this sounds moot because it looks like there are not in fact two classes of intrinsics.
I think the point here is not just about Add
, Mul
or Div
.
I guess no one is worry about the speed of the constant-time Add
or Mul
. And Div
is not the only one that the constant-time algorithm is not trivial.
The question is whether math/bits
should be guaranteed to be constant-time.
There are some options.
math/bits
are constant-time. The downside is that some functions like Div
will be very slow. This is the point of the issue.math/bits
are constant-time and some are not. It will be documented.math/bits
is guaranteed to be constant-time.Div
and ConstantTimeDiv
. And there is only one Add
.VariableTimeDiv
and Div
. And there is only one Add
.math/bits
. There will be Add
and ConstantTimeAdd
.math/bits
. There will be VariableTimeAdd
and Add
.Obviously, 1 is not OK.
2 and 4 are not safe.
For 5, almost everyone will use bits.VariableTimeDiv
, which is kind of weird. And I need to remember that there is bits.VariableTimeDiv
but there isn't bits.VaraibelTimeAdd
.
I think 6 and 7 are worse than two packages, maybe math/bits
and crypto/bits
.
So I think 3 is the acceptable option. Maybe a new package crypto/bits
will be added later.
If you have a variable time Add in a crypto implementation, nothing will tell you.
There are some discussions about how to test whether the code is constant-time. https://github.com/golang/go/issues/20654#issuecomment-313798127
@rsc I double-checked with Agner Fog's "Instruction tables" and can confirm that there is no constant-time Div64 instruction in any of the recent microarchitectures by Intel or AMD, nor in ARM Cortex A57/A72 according to their "Software Optimization Guide", which includes the following:
Integer divides are performed using a iterative algorithm and block any subsequent divide operations until complete. Early termination is possible, depending upon the data values.
One option would be to have a separate constant-time package.
I feel like we are solving a problem that does not exist yet. It looks like all current architectures have fast, constant-time instructions for implementing intrinsics. That should make the fallbacks nearly irrelevant, so it shouldn't matter their performance.
I would be more than happy with a package comment like the following
// All operations except Div are implemented in constant time.
// Future additions might not be constant time.
This avoids locking ourselves into an answer for now, so we can make a decision later with more information if something changes, instead of speculating. And in the meantime no cryptography implementor expects division to be constant time anyway.
One either guarantees that a function or package is constant time, or one does not. "Future additions might" means that backward compatibility could be broken, which of course cannot be allowed. The very issue is about "locking ourselves into an answer" and a much better solution is to let crypto functions live in crypto package.
Actually, this constant-time algorithm in math/bits
was discussed and rejected before.
https://github.com/golang/go/issues/18616#issuecomment-273919274
The reason to reject is simple.
math/bits
is for performance, but generally, the constant-time algorithm is not the fast algorithm.
Besides Div
, the constant-time algorithm for LeadingZeros
is not trivial. I don't know whether it is faster or slower than the current one. But it is also an issue.
As for constant-time cryptography code, I think it is much important to have the tool to test whether the code is constant-time.
With that tool, anyone can use any library to write the code and verify that the code is constant-time.
Without that tool, a constant-time math/bits
isn't very helpful because it is still very difficult to verify the code for the entire encryption is constant-time.
And this issue is an example which https://github.com/golang/go/issues/17373 is trying to solve.
the Go 1 compatibility makes designing good abstraction pretty hard, and because the nature of this problem, we can't design the API in a package in a subrepo (e.g. x/simd) like context or http2 and later move it into std when it's mature.
Because user-defined assembly functions can't be intrinsicified, the code must be placed in stdlib
.
Then it is controversial because it becomes a compatibility issue.
As far as I know, making a tool to ensure constant time code is an unsolved problem in cryptography engineering, at least in the general case. The attempts I know about are also dynamic, not static, so not really an option for cross-platform development.
Also, the Go 1 Compatibility Promise applies to functions, not packages, so "Future additions might not be constant time." is not at all a violation, as long as the functions that exist stay constant time.
I agree that it is not a violation. But I don't think it is a good idea to have "Future additions might not be constant time" in the document.
And compatibility promise means the functions must be constant-time forever.
If someone comes up with a faster LeadingZeros
algorithm, the problem becomes
Is that acceptable to have a new faster VariableTimeLeadingZeros
?
It doesn't sound like we should say "All but these" are constant time. Enumerate the ones we want to be constant-time, document each one as such, and then don't worry about having to say "Future ones may not fit the blanket rule." It doesn't seem like LeadingZeros64 is particularly easy to do in constant time, for example (absent hardware support).
It would be fine to just start with Add, Sub, Mul (certainly not Div). Filippo, are there others that are important?
Sounds good to me. (Including sized variants, of course.)
We definitely use RotateLeft, too, and I can see a use for ReverseBytes. Their fallbacks are already constant time.
OK, then let's start with Add, Sub, Mul, RotateLeft, ReverseBytes for now (including sized variants). Accepted.
I removed the Documentation label because there is code work to do too in the fallbacks. I milestoned it Go 1.14, but if it goes into Go 1.13 that's fine too.
It seems that the ReverseBytes and RotateLeft fallbacks are already constant time. Should the change then modify Add and Sub, while documenting Add, Sub, Mul, RotateLeft, and ReverseBytes to be constant time?
Change https://golang.org/cl/170758 mentions this issue: math/bits: make Add and Sub fallbacks constant time
Ah, right, reopening to document the property. I'll send a CL tomorrow.
Change https://golang.org/cl/178177 mentions this issue: math/bits: document that Add, Sub, Mul, RotateLeft, ReverseBytes are constant time
I'm a heavy non-crypto user of
bits.*
and I'm afraid changes like #31229 will effect my performance. One way of keeping crypto people happy and users like me would be to have separate functions for both non-constant time and constant time addition, multiplication, ...In the issue above I was reminded that the change affects only the fallbacks. I'm afraid promising that the fallbacks are constant time will later lead to constant time native implementations which could very well be slower than their non-constant time alternatives. For example, platforms which don't support carry flag are RISC-V and WebAssembly and it could happened that "a branch works out better" [1].
[1] https://gmplib.org/manual/Assembly-Carry-Propagation.html