Implement optimized AKS

leto commented 11 years ago

From this paper:

http://www.cse.iitk.ac.in/users/manindra/algebra/primality_v6.pdf

Should we skip this and just implement the even-more optimized algorithms of DJB?

danaj commented 11 years ago

Doing the v6 implementation should be easy -- it's just a change in the r calculations. The polynomial tests are completely identical -- we just do a lot fewer of them (and the polynomials are smaller degrees). It certainly won't hurt to do it. (Note that their speculation in section 6 on the Bhattacharjee conjecture has been shown to be false -- see Popovych 2008 for instance -- this has no real bearing on the implementation).

The polynomials are where almost of the time is spent. See (Crandall and Papadopoulos 2003):

http://images.apple.com/acg/pdf/aks3.pdf

for some ideas. In particular, doing the polynomial math by creating a big GMP integer then squaring it (taking advantage of GMP's big integer multiplication implementations, especially newer versions) as needed. Both of Bernstein's papers give outlines of this method, so I suspect we'll run into it in any AKS implementation.

There are a number of other improvements to the r calculation given by Bernstein's first AKS paper (http://cr.yp.to/papers/aks.pdf). Some of those look interesting to try out.

Bernstein's second paper (http://cr.yp.to/primetests/quartic-20060914-ams.pdf) is the really interesting one for AKS performance.

I'll also point out Cohen's "A Course in Computational Algebraic Number Theory" presents the APR test and ECPP. Riessel has a little bit about them, but no real algorithm detail.

http://www.ams.org/journals/mcom/1987-48-177/S0025-5718-1987-0866102-2/S0025-5718-1987-0866102-2.pdf

is the primary APR-CL reference. Pari has an implementation of course.

For ECPP, there are various rudimentary implementations floating around. I really need to get busy and finish mine for Math::Prime::Util -- I think I left it about 60% done. It's really basic but should still handle 100+ digit primes. I think ECPP is the fastest and most practical of the primality proving methods these days.

danaj commented 11 years ago

I did a v6 implementation on my fork. There's one decision point that I struggle with in my code, which is whether to change the trial division test right after selecting r. In the paper, it does the completely trivial test of asking if r == n. However we've just finished doing trial division for all primes <= r, so we can trivially extend this to r*r <= n. This makes us answer quickly all the way up to 85990. It doesn't make any difference for really big numbers, and it makes any tests take longer just because we're not running the poly tests until later. Hence the decision.

If the sqrt test is done, then we quickly answer up until 85991 which is the first number that makes us run the poly code. But at this point we now take ~30 minutes to finish, so testing is a bear.

If the sqrt test is not done, then we run the poly code even for many small numbers like 31. Testing can now be done in a reasonable time, but now we're very restricted in what we can run in the actual test suite without making the clients wonder if their system hung.

I ended up leaving in the code, after running:

perl -IMath-Primality-0.09/lib -MMath::Primality -MMath::Primality::AKS -E 'do { die "$_" unless !!Math::Primality::is_prime($_) == !!Math::Primality::AKS::is_aks_prime($_) } for 1 .. 100000;'

for a while with it off. It tests up through 200 in about 4 minutes. 10 minutes for the next 200.

The only changes are to the r and polylimit calculations. The A,N,R test loop is unchanged as is the poly code.

Times to run AKS on 85991 (all code does the poly test):

     1.5s   Math::Prime::Util XS  [C]
    10.1s   Math::Prime::Util GMP  [C+GMP]
   160.6s   Math::Prime::Util PP  [Perl using Math::BigInt::GMP]
  1941s     Math::Primality (with v6)

Math::Prime::Util does as much work as it can in native integers, and this is a 64-bit machine. On 32-bit machines, inputs > 65536 start making the XS and PP code run slower. By the way, the Rotella code comes up with 1190 for the polylimit instead of 289, which means 4x as many tests run on polynomials with 4x higher degree. It would be much slower.

danaj commented 11 years ago

I just commited a lot of changes to BigPolynomial. It runs faster, but there were also some other issues I had: isEqual was not comparing the last term, I don't think mpz_poly_mod_power was generating the right result, and most important -- we never saw it because we were comparing == instead of != in the AKS loop!

I left the mulmod structure the same, so it still follows the Rotella design. I think redoing the design may be faster, but that's for later. A lot more testing is still needed on this.

New time: 271s for 85991 (down from 1941s).

Edit: Just checked in the changed mulmod as well as a few simple bigpoly tests. It could use more.

leto commented 11 years ago

Awesome! Can you send a pull request for your changes?

danaj commented 11 years ago

I just converted Math::Prime::Util::GMP's poly multiplication routine to use binary segmentation, and it's much faster now. AKS as a whole sped up 10-50x, and looks faster than my straight C code for almost all inputs. This is still on github -- I did it right after the 0.09 CPAN release.

The version that creates the big number using shifts+adds should be easily portable to Math::GMPz, and sped things up ~10x. Even with the speedup, callgrind measured the majority of the time was spent fiddling with poly->p and p->poly instead of the key step p*p. I implemented the poly->p and p->poly steps using import/export as suggested by Bernstein's updated Quartic paper and that reduced the overhead a lot. With Perl+Math::GMPz there's probably a clever way to build/split the number using string joins if nothing else.

This now completes 85991 in 1.2 seconds, 109+7 in 13s, and 1019+51 in 19 minutes -- these are much faster than before. Brent's 2010 presentation here:

http://maths-people.anu.edu.au/~brent/pd/UMS10t4.pdf

makes it look like Bernstein's alternative could be faster yet.

All of this is still rather sad compared to ECPP or even n-1. Brent's presentation indicates the same thing, and states in the conclusions, "AKS is not a practical algorithm. ECPP is much faster." I also have my n-1 proofs outputing easily verified certificates, and can verify ECPP certificates. Verifying AKS certificates looks essentially equal in time to producing them, meaning it would be really slow.

For actual numbers from Brent's presentation:

10**19+51: ECPP <1s. n-1 <1s. Brent AKS 12.5 minutes. My AKS: 19 minutes.
10**49+9: ECPP 0.5s. n-1 0.1s. Brent AKS 32 hours. My AKS 102 hours (est).
10**100+267: ECPP ~1s. n-1 ~5 minutes. Brent AKS 1 year (est). My AKS 364 days (est)
2**511+111: ECPP 35s. n-1 8.5 minutes. Brent AKS 13 years (est). My AKS 9 years (est)
2**1023+1155: ECPP ~2 minutes. n-1 >97 hours. Brent AKS 840 years (est). My AKS 718 years (est)

Noting: The ECPP numbers are from a work-in-progress implementation. Primo (state of the art ECPP and n-1/n+1) takes 0.81 seconds for 2**1023+1155 including certificate generation. My n-1 prover uses BLS75 theorem 5 (I recently changed it to allow theorem 7 but it doesn't help much so it's compiled out by default). Brent's implementation uses Bernstein's improvements but runs on a much slower computer than mine. Memory use for AKS is problematic also -- ~3GB for the large example (1M element polynomials with 308 digits per coefficient).

danaj commented 10 years ago

In case someone wants to work on this some more. Folkmar Bornemann's AKS implementation from late 2002 is interesting to see (also reference Bernstein's 2003 AKS paper). It includes some improvements from Bernstein and Voloch that speed things up about 200x over the V6 implementation. It requires a bit more work, but I got versions in C and C+GMP working in less than a day each (starting with a V6 implementation and some infrastructure).

As I mentioned in a previous comment, the V6 algorithm is pretty straightforward. It has the advantage that it should be clear how to implement it correctly. The papers describing various additional improvements tend to be written with much less regard to implementation (lots of details left to the reader), so there is more question of whether an implementation correctly identifies all composites. There is probably room for an undergrad paper / project to rigorously combine the proofs and implementation.

Repeating the timings from above, using these AKS improvements in my GMP code. The ECPP is what has been in the CPAN module for a few months. [Edit: one has to be careful of overflowing logs with GMP -- retiming]

n	N-1	AKS	ECPP
10^19+51	.000056s	3s	.000011s
10^49+9	.040s	29m	.0046s
10^100+267	80.1s	44h (est)	.030s
2^511+111	6.5m	19 days (est)	0.122s
2^1023+1155	days?	4.5 yrs (est)	1.56s

The AKS times are better than the previous ones (so are the ECPP times). Memory use is much better also (though higher than APR-CL or ECPP). AKS should be almost trivial to parallelize or distribute. ECPP can be parallelized (e.g. Primo) but isn't quite so easy.

N-1 with some good factoring code works extremely well for numbers under 128 bits or so, but it relies on factoring so doesn't scale well in general. AKS has the advantage over the method from the 1970s as the number grows. N-1 and ECPP have the significant advantage of being able to produce certificates.

APR-CL and ECPP have been the workhorses for quite a few years now. They're still leading handily, but this does give some idea that perhaps an implementation with all of Bernstein's 2003 improvements wouldn't be so bad, and would still have some pedagogical value (as would his 2006 quartic randomized method, which I'm not sure anyone has implemented).

leto / math--primality

Implement optimized AKS #10