p448 32-bit uses int128

mit-plv / fiat-crypto

Cryptographic Primitive Code Generation by Fiat

http://adam.chlipala.net/papers/FiatCryptoSP19/FiatCryptoSP19.pdf

Other

718 stars 147 forks source link

p448 32-bit uses int128 #797

Open JasonGross opened 4 years ago

JasonGross commented 4 years ago

There are a lot of Z.combine_at_bitwidth operations left over. Not sure why, I'll try to dig into this.

JasonGross commented 4 years ago

Ah, I've figured it out, it's because there are operations like

expr_let x962 := Z.combine_at_bitwidth(32, (uint32_t)x779, (uint32_t)x780) + Z.combine_at_bitwidth(32, (uint32_t)x958, (uint32_t)x961) (* : [0x0 ~> 0x1055c28e3f0a3d758] *) in

where the sum fails to fit within bounds.

JasonGross commented 4 years ago

From discussions with @andres-erbsen, the reason is that our loose bounds are 3 times our tight bounds, which are 1.1 times the limb-width. (This is coded for in https://github.com/mit-plv/fiat-crypto/blob/99e2ae04406fe321f4f2a8ad95d12a8f053ad65e/src/UnsaturatedSolinasHeuristics.v#L57-L58 )

Andres said that a relevant paper for p448 is "Ed448-Goldilocks, a new elliptic curve" by Michael Hamburg:

If we adjust the loose bounds multiplier to 2, though, that doesn't work, because subtraction adds balance. In curve25519 x64, this is https://github.com/mit-plv/fiat-crypto/blob/99e2ae04406fe321f4f2a8ad95d12a8f053ad65e/fiat-c/src/curve25519_64.c#L292-L296 In p448 x32, this is

  uint32_t x1 = ((UINT32_C(0x1ffffffe) + (arg1[0])) - (arg2[0]));
  uint32_t x2 = ((UINTv32_C(0x1ffffffe) + (arg1[1])) - (arg2[1]));
  uint32_t x3 = ((UINT32_C(0x1ffffffe) + (arg1[2])) - (arg2[2]));
  uint32_t x4 = ((UINT32_C(0x1ffffffe) + (arg1[3])) - (arg2[3]));
  uint32_t x5 = ((UINT32_C(0x1ffffffe) + (arg1[4])) - (arg2[4]));
  uint32_t x6 = ((UINT32_C(0x1ffffffe) + (arg1[5])) - (arg2[5]));
  uint32_t x7 = ((UINT32_C(0x1ffffffe) + (arg1[6])) - (arg2[6]));
  uint32_t x8 = ((UINT32_C(0x1ffffffe) + (arg1[7])) - (arg2[7]));
  uint32_t x9 = ((UINT32_C(0x1ffffffc) + (arg1[8])) - (arg2[8]));
  uint32_t x10 = ((UINT32_C(0x1ffffffe) + (arg1[9])) - (arg2[9]));
  uint32_t x11 = ((UINT32_C(0x1ffffffe) + (arg1[10])) - (arg2[10]));
  uint32_t x12 = ((UINT32_C(0x1ffffffe) + (arg1[11])) - (arg2[11]));
  uint32_t x13 = ((UINT32_C(0x1ffffffe) + (arg1[12])) - (arg2[12]));
  uint32_t x14 = ((UINT32_C(0x1ffffffe) + (arg1[13])) - (arg2[13]));
  uint32_t x15 = ((UINT32_C(0x1ffffffe) + (arg1[14])) - (arg2[14]));
  uint32_t x16 = ((UINT32_C(0x1ffffffe) + (arg1[15])) - (arg2[15]));

The constants here are mostly 2²⁹-2, which we add to something that is bounded by 2²⁸ · c. This gives 2²⁹+ 2²⁸ · c - 2 as our loose bounds, and 2²⁸ · c as our tight bounds. I guess we could be stricter about computing loose bounds, maybe I'll go try that.

EDIT by andres: corrected paper title I misremembered earlier

andres-erbsen commented 4 years ago

https://sourceforge.net/p/ed448goldilocks/code/ci/master/tree/src/p448/arch_32/f_impl.c might be good reference code

JasonGross commented 4 years ago

The constants here are mostly 2²⁹-2, which we add to something that is bounded by 2²⁸ · c. This gives 2²⁹+ 2²⁸ · c - 2 as our loose bounds, and 2²⁸ · c as our tight bounds. I guess we could be stricter about computing loose bounds, maybe I'll go try that.

Wait, this is wrong. We have the following constraints:

From addition, loose bounds on each limb are at least 2 · tight bounds
From subtraction, loose bounds on each limb are at least tight bounds + 2²⁹ - 2
Tight bounds must be at least 2²⁸
From the paper cited above about multiplication (which takes in loose limbs), loose bounds over 2²⁸ must be at most 2⁴/ √38 ≈ 2.596

From (2) and (3), we get that loose bounds must be at least 2²⁹ + 2²⁸ - 2. But then loose bounds over 2²⁸ is 2 + 1 - 2^-27 > 2.5. So this constraint system seems impossible to satisfy, at least assuming subtraction is tight -> loose. @andres-erbsen does this analysis make sense to you? And I'm having trouble understanding the code at that link. Does it implement ladderstep? / Can you tell if implementing curve arithmetic goes straight from subtraction into multiplication without doing any carries?

JasonGross commented 4 years ago

Interestingly, @kevaundray (whose emails started me looking into 32-bit p448) seems to be also using int128 at https://github.com/crate-crypto/Ed448-Goldilocks/blob/b8338be937ac1e990be8db2c9f1d83bad295d3e0/src/field/u32/karatsuba.rs#L16

kevaundray commented 4 years ago

Hey @JasonGross , Thanks for the mention.

In the past, I used u64 for the Karatsuba multiplication and reduce: https://github.com/crate-crypto/Ed448-Goldilocks/blob/3023e70227b9a1ab34a0d21b3094c9215e2553e7/src/field/base/karatsuba.rs#L13

This is exactly the same code in Rust as the reference implementation that @andres-erbsen linked.

It required explicit wrapping_arithmetic syntax to avoid panics in debug mode for Rust, so I changed it to i128 for clarity.

To possibly answer some questions that you asked above:

From what I understand, the curve arithmetic layer has access to more methods from the field so that you can do field arithmetic optimisations on the curve layer:

add_no_reduce : adds two field elements without reducing [Included in your repo]
sub_no_reduce : subtracts two field elements without reducing [Included in your repo]
bias: Adds a multiple of p to the field element. [Included but I believe for specific multiples of p]

So curve arithmetic used to look like the following in our Rust Ed448: https://github.com/crate-crypto/Ed448-Goldilocks/blob/098ffcd4c03836890c24e4669027fcb4fff87cb1/src/curve/twedwards/extensible.rs#L97

However, the most updated branch does a weak reduction after every operation on the curve arithmetic layer because the implementing it in the previous way would require a different implementation of the curve arithmetic layer for every representation of a field element you implement. This is because add_nr/sub_nr/bias on the curve arithmetic layer depends on what data type you are using to implement the limbs/the amount of headroom you have available.

I believe that this is what the reference implementation does to squeeze out performance by autogenerating the curve arithmetic layer with a constant named GF_HEADROOM which denotes the maximum limb headroom available.

Here is a link to the updated Rust master which carries after each operation: https://github.com/crate-crypto/Ed448-Goldilocks/blob/master/src/curve/twedwards/extensible.rs#L75

Let me know if you spot any errors.

JasonGross commented 4 years ago

It required explicit wrapping_arithmetic syntax to avoid panics in debug mode for Rust, so I changed it to i128 for clarity.

Doesn't this change the semantics as well? To me this suggests a bug in the reference implementation...

bias: Adds a multiple of p to the field element. [Included but I believe for specific multiples of p]

Indeed, we always bias by 2p before subtracting. This is because biasing by p is never enough for unsaturated solinas, and biasing by 2p is always enough because our tight bounds ensure that we don't exceed approximately 1.1*s (where s is the leading term in p).

However, the most updated branch does a weak reduction after every operation on the curve arithmetic layer because the implementing it in the previous way would require a different implementation of the curve arithmetic layer for every representation of a field element you implement. This is because add_nr/sub_nr/bias on the curve arithmetic layer depends on what data type you are using to implement the limbs/the amount of headroom you have available.

I'm not convinced by this reasoning. The last time I looked at it, it seemed like things would typecheck with no extra reduction given the operations

bias_sub_no_reduce : tight -> loose
add_no_reduce : tight -> loose
relax : tight -> loose (a no-op just present for bounds type-checking)
carry_reduce_mul : loose -> tight

I thought you never needed headroom for more than a single addition/subtraction before multiplication+carrying+reduction, and that you always had headroom for at least one such addition/subtraction. Is this not the case?

GF_HEADROOM

I could not easily find this constant. Do you have a pointer to the code that uses this to autogenerate the curve layer?

kevaundray commented 4 years ago

Doesn't this change the semantics as well? To me this suggests a bug in the reference implementation...

I'm not 100% sure as I have not checked and do not completely understand the cancellation that happens with u64 accumulators.

I'm not convinced by this reasoning. The last time I looked at it, it seemed like things would typecheck with no extra reduction given the operations

bias_sub_no_reduce : tight -> loose add_no_reduce : tight -> loose relax : tight -> loose (a no-op just present for bounds type-checking) carry_reduce_mul : loose -> tight

I'm not convinced by this reasoning either. Let me try again.

Theoretical Headroom

Given 16 28 bit limbs, I can use the data-type u32.
This means I have 4 bits of headroom.
This means that GF_HEADROOM is 16. So the largest value we can have is 16p.

Subtraction

Assuming x and y are reduced coefficients, then x-y = [-(1+e)p, (1+e)p]
Since weak_reduction needs a positive number, we add 2p to x-y.
Adding 2p means that the result is in the range [(1-e)p, (3+e)p]
Whether (3+e)*p is okay depends on the available headroom, which in turn depends on the data-type. Since it is 16p, this is fine.
Once reduction is complete the result is in the range [0, (1+e)*p]

So the above was a combination of bias_sub_no_reduce and a weak_reduction[carry].

Actual Headroom

The actual headroom is not 16, but is 2 for 32-bits because the headroom also takes into account the maximum valid input.

I thought you never needed headroom for more than a single addition/subtraction before multiplication+carrying+reduction, and that you always had headroom for at least one such addition/subtraction. Is this not the case?

So I think that a single addition/subtraction without reducing would be fine before going into the multiplication procedure, or as long as you can guarantee that the headroom of 2 is not exceeded.

I could not easily find this constant. Do you have a pointer to the code that uses this to autogenerate the curve layer?

I double checked and the code is not being generated, my bad. It is one file, however there are if statements to check the GF_HEADROOM constant. https://sourceforge.net/p/ed448goldilocks/code/ci/master/tree/src/per_curve/decaf.tmpl.c#l342

Here is an example of it being defined: https://sourceforge.net/p/ed448goldilocks/code/ci/master/tree/src/p448/arch_32/f_impl.h#l5

Let me know if you spot any errors or if it is not convincing

bitwiseshiftleft commented 4 years ago

@kevaundray summoned me by email.

What’s limiting the headroom is accumulation in multiplication routine. That routine accumulates sums and differences of up to some number of products of limbs (it would be 448/28 = 16 products but then reduction increases is). These values are accumulated into unsigned 64-bit integers, for which the starting headroom is 64-2*28 = 8 bits.

Overflows and underflows in the intermediate Karatsuba values are fine as long as they cancel out. For libdecaf in particular, various values are added and subtracted in the Karatsuba implementation. The final sum is always positive, but the additions and subtractions are done in an order that might wrap to a negative value on intermediates, but it will wrap back to positive by the end.

Edit: this means that i128 should also work, and won't break the semantics, because by the time you're shifting it the result is actually a 64-bit number. But of course i128 has worse performance.

What's important is that the final accumulated value for each place value doesn’t overflow before the >>28, and this depends on the headroom. So the headroom is roughly h such that (h2^28)^2 k < 2^64, meaning that h < sqrt(2^8 / k). It also depends on what exactly epsilon is, and you'll have to check the math exactly before coding it in Coq.

We could propagate carries earlier in the multiplication routine, or use a larger accumulator, both of which would increase the headroom. But increasing the headroom is a performance optimization (it saves weak_reduce calls) and propagating carries earlier also costs performance, so it might not be worth it.

As Kevaundray pointed out, libdecaf also contains a section where a value that's up to (3+epsilon)*p (or rather, up to (3+epsilon)*2^28 in each limb) is multiplied by a value that's up to (2+epsilon)p, in violation of the headroom bound of 2. This is a bit sloppy, but in the end it's OK because what you really care about is the maximum product of two limbs. This product has to be less than (2^28)^2 2^8 / 38 or so, where 2^8 / 38 = 6.7368... So it's safe to multiply a value where the limbs are at most (3+epsilon)*2^28 by one where they are at most (2+epsilon)*2^28 , since the product is still only just above 6*2^56.

Edit: multiplication vs italics.