Closed JasonGross closed 8 years ago
I don't see an obvious nice way to do this; carry
doesn't handle overflowing. Of the options you presented, I think (1) sounds most reasonable.
The current carry
was written to be used in situations where the widths of the limbs are significantly below the widths of the underlying machine integers, for example 51-bit limbs in 64-bit registers. A couple of elementwise additions can be performed safely without carrying, and the engineer implementing something that uses addition would insert carries sparingly, just enough to make sure that accumulating bounds don't lead to overflow.
What you are asking for is something different. When using the full width of the machine words as the limb width, every single-element operation needs to be able to deal with carries internally. This lets you use fewer words and fewer arithmetic operations, but requires more carries and imposes significant data dependency constraints on the evaluation order of the algorithm being implemented. We do not have a formalization of this strategy right now, but we will need it. In particular, I think
Cases 1 and 3 probably use the same code, 2 might or might not. Now what we have to do is to design this.
I think it would be best to use a bounded type for the limb width = word width
case because there really isn't a place were reasoning about bounds is to be postponed. https://coq.inria.fr/library/Coq.Numbers.Cyclic.Abstract.CyclicAxioms.html#ZnZ.zero seems to have some carry logic (e.g. spec_add_carry_c
). In our library, possible candidate types are word 64
and F (2^64)
. The rep invariant could be stated by mapping the elements of the tuple to Z and then asserting the BaseSystem
rep invariant, but the correctness of all operations would need to reason about boundedness.
4 Montgomery reduction (or almost montgomery multiplication) with 256-bit words (needed for conditional subtract at the end)
As discussed yesterday, we want a module for creating wider fixed-width integer operations from given fixed-width operations. Then these wider operations would be used to instantiate bignum-based constructions like ZLikeOps
, and the low-level operations would be mapped to machine instructions (followed by usual low-level optimizations like changing add-with-carry
to add
whenever possible and instructions scheduling, possibly spilling, register allocation).
Assuming an underlying n
-bit architecture, I envision the following constructions.
k*n
-bit word by a*n
bits, drop the k-a
words that are no longer relevant (no instructions generated).k*n
-bit word by a
bits using a n
x n
-> n
bit shift operation, first drop the a/n
words that are never used and then shift the rest of the words by a mod n
bits in a single loop.k*n
-bit modulo 2^(a*n)
operation, drop a
words. (no code generated)k*n
-bit modulo 2^a
operation, drop a/n
words and then use bitwise and to mask out a mod n
bits.k*n
-bit word using n
-bit addition that can input and output a carry bit, use the ripple carry construction. Subtraction can be built similarly. Both constructions should themselves both input and output a ripple carry bit.To build a a*n
x b*n
bit multiplier using a n
x n
bit multiplier and a n
x n
bit adder, use schoolbook multiplication and probably inline the construction of a 2n
x 2n
-> 2n
bit add-with-carry. The data flow in this computation is delicate due to carry bits. The a
x b
grid of partial products should be computed row-by-row but never stored, instead 2b
accumulator variables corresponding to the sums of each column of the table should be maintained. Each step would include computing a new partial product, rippling in the carry from the previous partial product in the same row, adding the new partial product to the column. http://link.springer.com.libproxy.mit.edu/chapter/10.1007%2F978-3-642-31662-3_9#page-1 section 3 gives a picture, curl https://raw.githubusercontent.com/sipa/Coin25519/master/src/crypto/ed25519/amd64-64-24k/fe25519_mul.s 2>/dev/null | grep qhasm
heavily instruction-scheduled code for pseudo-Mersenne multiplication on a 64
-bit machine using this pattern, with one partial product step being
mulrax = *(uint64 *)(yp + 16) # load column-wise coefficient
(uint128) mulrdx mulrax = mulrax * mulx2 # mulx2 is the row-wise coefficient, loaded earlier
carry? mulr4 += mulrax # add low half of partial product to accumulator
mulrdx += 0 + carry # propogate carry to high half
carry? mulr4 += mulc # add high half of the previous partial product to accumulator
mulc = 0
mulc += mulrdx + carry # save high half of this partial product to be accumulated at the next partial product
probably inline the construction of a
2n
x2n
->2n
bit add-with-carry
I think we should actually parameterize over this: One wrinkle here is that it seems like x86_64 has a 64x64 -> 2 * 64 multiplier and a 64+64 -> 64 adder, while the machine I'm targeting has a 128x128 -> 1*256 multiplier and a 256+256 -> 256 adder. It would be nice to support both of these (having a same-width adder/multiplier vs having a half-width multiplier) in whatever lemmas / definitions we write. (e.g., parameterize the multiplier over a function that splits the bits into the size of the underlying multiplier, ask for both a full-width adder and a half-width adder and a lemma that says that the half-width adder composed with the splitting function, ripple-carried across the array, is the same as the full-width adder ripple-carried across the original array; or else have automation that takes care of both lemmas)
You are right that this is not actually quite the same pattern. I don't have insight about which strategy for sharing code would be the best, my first guess would be to try writing a specification for the piece of code that handles a single partial product and make two implementations of that, share code for the fixpoint and induction proof that put together the multiplication. But I expect this will become much more clear once you write both definitions; feel free to poke me to look at things any time.
I'm going to close this; we've taken a different path to utilizing add-with-carry instructions.
I'm trying to get an add-with-carry on
digits
that gives me both the resulting sum, moduloupper_bound limb_widths
, and a boolean indicating whether or not there's a carry in the most significant bit. I'm looking for something that will be most reusable / best-suited to the underlying assembly. Options I see are:carry_gen
to also return the value it carries, make variants ofcarry_simple
andcarry_simple_sequence
that do the sameThis is to instantiate
ZLikeOps
from #49.