What's the right way to get a carry-bit out of Pow2Base-carrying?

JasonGross commented 8 years ago

I'm trying to get an add-with-carry on digits that gives me both the resulting sum, modulo upper_bound limb_widths, and a boolean indicating whether or not there's a carry in the most significant bit. I'm looking for something that will be most reusable / best-suited to the underlying assembly. Options I see are:

Build it on top of what's already existing: first add, then return, separately, both the result of carrying, and the boolean given by testing whether the decoded value (pre-carrying) is too big / matches with the result of decoding the carried value
Generalize carry_gen to also return the value it carries, make variants of carry_simple and carry_simple_sequence that do the same
Inject the digits into a larger base, compute the addition in the larger base, and then check the high bit, and return that, together with the truncated digits

This is to instantiate ZLikeOps from #49.

jadephilipoom commented 8 years ago

I don't see an obvious nice way to do this; carry doesn't handle overflowing. Of the options you presented, I think (1) sounds most reasonable.

andres-erbsen commented 8 years ago

The current carry was written to be used in situations where the widths of the limbs are significantly below the widths of the underlying machine integers, for example 51-bit limbs in 64-bit registers. A couple of elementwise additions can be performed safely without carrying, and the engineer implementing something that uses addition would insert carries sparingly, just enough to make sure that accumulating bounds don't lead to overflow.

What you are asking for is something different. When using the full width of the machine words as the limb width, every single-element operation needs to be able to deal with carries internally. This lets you use fewer words and fewer arithmetic operations, but requires more carries and imposes significant data dependency constraints on the evaluation order of the algorithm being implemented. We do not have a formalization of this strategy right now, but we will need it. In particular, I think

barrett reduction modulo p256 using 256-bit words needs this for addition
barrett reduction modulo order of Curve25519 basepoint needs this for addition and multiplication
128-bit word addition (used pseudo-Mersenne multiplication modulo 2^255-19) needs this for addition

Cases 1 and 3 probably use the same code, 2 might or might not. Now what we have to do is to design this.

andres-erbsen commented 8 years ago

I think it would be best to use a bounded type for the limb width = word width case because there really isn't a place were reasoning about bounds is to be postponed. https://coq.inria.fr/library/Coq.Numbers.Cyclic.Abstract.CyclicAxioms.html#ZnZ.zero seems to have some carry logic (e.g. spec_add_carry_c). In our library, possible candidate types are word 64 and F (2^64). The rep invariant could be stated by mapping the elements of the tuple to Z and then asserting the BaseSystem rep invariant, but the correctness of all operations would need to reason about boundedness.

JasonGross commented 8 years ago

4 Montgomery reduction (or almost montgomery multiplication) with 256-bit words (needed for conditional subtract at the end)

andres-erbsen commented 8 years ago

As discussed yesterday, we want a module for creating wider fixed-width integer operations from given fixed-width operations. Then these wider operations would be used to instantiate bignum-based constructions like ZLikeOps, and the low-level operations would be mapped to machine instructions (followed by usual low-level optimizations like changing add-with-carry to add whenever possible and instructions scheduling, possibly spilling, register allocation).

Assuming an underlying n-bit architecture, I envision the following constructions.

To build a shift of a k*n-bit word by a*n bits, drop the k-a words that are no longer relevant (no instructions generated).
To build a shift of a k*n-bit word by a bits using a n x n -> n bit shift operation, first drop the a/n words that are never used and then shift the rest of the words by a mod n bits in a single loop.
To build a k*n-bit modulo 2^(a*n) operation, drop a words. (no code generated)
To build a k*n-bit modulo 2^a operation, drop a/n words and then use bitwise and to mask out a mod n bits.
To build an addition of a k*n-bit word using n-bit addition that can input and output a carry bit, use the ripple carry construction. Subtraction can be built similarly. Both constructions should themselves both input and output a ripple carry bit.
To build a a*n x b*n bit multiplier using a n x n bit multiplier and a n x n bit adder, use schoolbook multiplication and probably inline the construction of a 2n x 2n -> 2n bit add-with-carry. The data flow in this computation is delicate due to carry bits. The a x b grid of partial products should be computed row-by-row but never stored, instead 2b accumulator variables corresponding to the sums of each column of the table should be maintained. Each step would include computing a new partial product, rippling in the carry from the previous partial product in the same row, adding the new partial product to the column. http://link.springer.com.libproxy.mit.edu/chapter/10.1007%2F978-3-642-31662-3_9#page-1 section 3 gives a picture, curl https://raw.githubusercontent.com/sipa/Coin25519/master/src/crypto/ed25519/amd64-64-24k/fe25519_mul.s 2>/dev/null | grep qhasm heavily instruction-scheduled code for pseudo-Mersenne multiplication on a 64-bit machine using this pattern, with one partial product step being
```
mulrax = *(uint64 *)(yp + 16) # load column-wise coefficient
(uint128) mulrdx mulrax = mulrax * mulx2 # mulx2 is the row-wise coefficient, loaded earlier
carry? mulr4 += mulrax # add low half of partial product to accumulator
mulrdx += 0 + carry # propogate carry to high half
carry? mulr4 += mulc # add high half of the previous partial product to accumulator
mulc = 0
mulc += mulrdx + carry # save high half of this partial product to be accumulated at the next partial product
```

JasonGross commented 8 years ago

probably inline the construction of a 2n x 2n -> 2n bit add-with-carry

I think we should actually parameterize over this: One wrinkle here is that it seems like x86_64 has a 64x64 -> 2 * 64 multiplier and a 64+64 -> 64 adder, while the machine I'm targeting has a 128x128 -> 1*256 multiplier and a 256+256 -> 256 adder. It would be nice to support both of these (having a same-width adder/multiplier vs having a half-width multiplier) in whatever lemmas / definitions we write. (e.g., parameterize the multiplier over a function that splits the bits into the size of the underlying multiplier, ask for both a full-width adder and a half-width adder and a lemma that says that the half-width adder composed with the splitting function, ripple-carried across the array, is the same as the full-width adder ripple-carried across the original array; or else have automation that takes care of both lemmas)

andres-erbsen commented 8 years ago

You are right that this is not actually quite the same pattern. I don't have insight about which strategy for sharing code would be the best, my first guess would be to try writing a specification for the piece of code that handles a single partial product and make two implementations of that, share code for the fixpoint and induction proof that put together the multiplication. But I expect this will become much more clear once you write both definitions; feel free to poke me to look at things any time.

JasonGross commented 8 years ago

I'm going to close this; we've taken a different path to utilizing add-with-carry instructions.

mit-plv / fiat-crypto

What's the right way to get a carry-bit out of Pow2Base-carrying? #50