better two_squares, three_squares, four_squares for small input

videlec commented 10 years ago

We implement a much more efficient version of square decomposition for small values.

CC: @nathanncohen @jdemeyer @nexttime

Component: number theory

Author: Vincent Delecroix

Branch/Commit: 51e36d8

Reviewer: Jeroen Demeyer, Leif Leonhardy, Nathann Cohen

Issue created by migration from https://trac.sagemath.org/ticket/16374

videlec commented 10 years ago

Branch: u/vdelecroix/16374

videlec commented 10 years ago

comment:1

Hi,

I implemented a much better version than my proposal in #16308 (linear complexity in the input for two_squares_pyx). It is not yet branched in arith.py, I will do this when #16308 is merged.

timings:

sage: from sage.rings.arith_pyx import four_squares_pyx
sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)")
5 loops, best of 3: 46 ms per loop

Best Vincent

New commits:

7cf5fff trac #16374: new file arith_pyx.pyx

videlec commented 10 years ago

Commit: 7cf5fff

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Changed commit from 7cf5fff to f1354e6

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

f1354e6 trac #16374: two small optimization

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:3

Hello !

I added a commit in u/ncohen/16374 which contains some tests and comments.

I have two questions to ask though :

Why ii <= n/2 and not ii<=jj ?
Why a "while" instead of a "if" in the following code ?

    j = (<unsigned int> sqrt(<double> n)) + 1 # (rounding is toward zero)
    while j*j > n:
        j -= 1

Besides, there is a way to avoid many multiplications, and perhaps it can improve the code a bit (no idea, profiling is the only way :-P)

sage: i = 5    
sage: ii = i**2
sage: i**2 == ii
True
sage: ii+=2*i+1
sage: i+=1
sage: i**2 == ii
True

Nathann

videlec commented 10 years ago

comment:5

Hi Nathann,

Thanks for reading it.

I added a commit in u/ncohen/16374 which contains some tests and comments.

Great. I also have something important to modify in three_squares_pyx and four_squares_pyx. It is much faster to first remove a huge square and then try to decompose the rest into a sum of less squares. The result is not the smallest for lexicographical order but the timings are x4 better.

It seems that you like C function declaration of the form int f(int res[3]). It makes the specifications clearer but at compilation the results are strictly equivalent.

I have two questions to ask though :

Why ii <= n/2 and not ii<=jj ?

Right.

Why a "while" instead of a "if" in the following code ?
    j = (<unsigned int> sqrt(<double> n)) + 1 # (rounding is toward zero)
    while j*j > n:
        j -= 1

I am not sure of the specification (if any) of sqrt which deals with double... I will try to improve it.

Besides, there is a way to avoid many multiplications, and perhaps it can improve the code a bit (no idea, profiling is the only way :-P)

I thought about it. I am not sure that ii += 2*i + 1 is cheaper than ii = i*i ? Will have a look.

Vincent

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

a2b31ca trac #16374: reviewer comment + further optimization

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Changed commit from f1354e6 to a2b31ca

videlec commented 10 years ago

comment:7

Hi,

I confirm that there is no difference between "ii = ii" and "ii -= 2i + 1". Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3". In the new commit there are the modifications relative to your comment and mine.

New timings

sage: from sage.rings.arith_pyx import four_squares_pyx
sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)")
25 loops, best of 3: 11.3 ms per loop

About the dependencies, I am not sure whether this ticket should go before or after #16308...

Vincent

videlec commented 10 years ago

Changed dependencies from #16308 to none

jdemeyer commented 10 years ago

comment:9

Instead of adding a new module arith_pyx, either

Rename this module such that it explicitly refers to sums of squares.
Use fast_arith.pyx instead.

jdemeyer commented 10 years ago

comment:10

Replying to @videlec:

Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3".

Sure, write the most readable code and let the compiler do its job.

About the dependencies, I am not sure whether this ticket should go before or after #16308...

Even better: make these two tickets independent and then make a third ticket to integrate them.

jdemeyer commented 10 years ago

comment:11

I would recommend to use unsigned long instead of instead int. If could imagine that the sum of 4 squares function is useful for numbers larger than 2^32.

Also a typo: plateform -> platform

videlec commented 10 years ago

comment:12

Replying to @jdemeyer:

I would recommend to use unsigned long instead of instead int. If could imagine that the sum of 4 squares function is useful for numbers larger than 2^32.

Of course. But for large entries, we would better use the *_squares from #16308. I will do some timings, but if we loose something on small integers I will not change it.

Also a typo: plateform -> platform

Thank you. I will modify it.

jdemeyer commented 10 years ago

comment:13

What's the point of <unsigned int> sqrt(<double> n + .5)? Are you worried that the libm sqrt function does not round correctly? I believe that IEEE-754 requires this.

videlec commented 10 years ago

comment:14

Replying to @jdemeyer:

What's the point of <unsigned int> sqrt(<double> n + .5)? Are you worried that the libm sqrt function does not round correctly? I believe that IEEE-754 requires this.

I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this .5 ?

jdemeyer commented 10 years ago

comment:15

Replying to @videlec:

I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this .5 ?

It is certainly not a compiler issue, but a libm/hardware issue. I would expect IEEE-754 and would remove the + 0.5.

83660e46-0051-498b-a8c1-f7a7bd232b5a commented 10 years ago

comment:16

Replying to @jdemeyer:

Replying to @videlec:

I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this .5 ?

It is certainly not a compiler issue, but a libm/hardware issue. I would expect IEEE-754 and would remove the + 0.5.

Yes. Integers up to 2⁵³-1 can be represented exactly in a double, at least on the plat[e]forms we support.

I like the type instead int btw. (Google would have asked: "Did you mean instant int?", or probably "insteady int"?)

Regarding that, I'd write versions for / using integers of known, platform-independent width, that is, use uint8_t, uint16_t, uint32_t etc., with a wrapper function that delegates to the appropriate function, or use if-then-elif-else such that the compiler is aware of the exact range of numbers. (Separate functions could more easily be checked for overflow conditions as well.)

83660e46-0051-498b-a8c1-f7a7bd232b5a commented 10 years ago

comment:17

Replying to @videlec:

I confirm that there is no difference between "ii = ii" and "ii -= 2i + 1".

There of course is; just look at what the (C) compiler produces from it. (I don't know what Cython itself does; it may optimize it, but may also further obfuscate the code.)

On older machines, presumably any 32-bit processor [we support], multiplication (or squaring) is more expensive than addition, increment, and/or a shift. (And its cost depends on the number of 1-bits in the multiplicands.)

Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3".

Nope, you have to be more explicit. It of course does some strength reduction, e.g. replacing multiplication/division by powers of two by shifts, additions or nice addressing modes etc., but you cannot expect it to run a theorem prover... ;-)

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

`5b1421f`	`trac #16374: rename arith_pyx.pyx into sum_of_squares.pyx`
`352e255`	`trac #16374: unsigned int -> unsigned long`

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Changed commit from a2b31ca to 352e255

videlec commented 10 years ago

comment:19

Hi,

The file is now sum_of_squares.pyx.

I changed the unsigned int to unsigned long as Jeroen suggested. The loss in terms of timing seems to be around 1% (see below). Nevertheless, the function makes no sense for very large input as it is much faster to called the Python function from arith (see also below).

For the rounding issue with sqrt, it uses now <unsigned long> sqrt(<double> n) and there is an exhaustive test in the doc of the function two_squares_pyx.

unsigned int

sage: from sage.rings.sum_of_squares import four_squares_pyx
sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)")
25 loops, best of 3: 11 ms per loop

unsigned long

sage: from sage.rings.sum_of_squares import four_squares_pyx
sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)")
25 loops, best of 3: 11.3 ms per loop

Very large input

sage: from sage.rings.sum_of_squares import two_squares_pyx
sage: %timeit two_squares_pyx(2**51+21)
10 loops, best of 3: 88.1 ms per loop
sage: %timeit two_squares(2**51+21)
1000 loops, best of 3: 1.04 ms per loop

Vincent

83660e46-0051-498b-a8c1-f7a7bd232b5a commented 10 years ago

comment:20

I wouldn't call 2⁵¹+21 "very large", at least not if the functions now take unsigned long s. (Ok, on most of current machines that exceeds an unsigned int.)

While we're at it, if n is unsigned long, you should take (unsigned long)sqrtl((long double)n), otherwise you'd probably have to add 1 to the result in case n is a perfect square or slightly above. (On 32-bit machines, where unsigned longs are only 32-bit as well, that doesn't matter.)

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Changed commit from 352e255 to 17883e3

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

17883e3 trac #16374: replace sqrt by sqrtl

videlec commented 10 years ago

comment:22

Replying to @nexttime:

I wouldn't call 2⁵¹+21 "very large", at least not if the functions now take unsigned long s. (Ok, on most of current machines that exceeds an unsigned int.)

Very large meant "the function is no more efficient with such input".

While we're at it, if n is unsigned long, you should take (unsigned long)sqrtl((long double)n), otherwise you'd probably have to add 1 to the result in case n is a perfect square or slightly above. (On 32-bit machines, where unsigned longs are only 32-bit as well, that doesn't matter.)

Done.

Vincent

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:23

Well... Theeeeeeeeeeeeeeeeeeeeeeen if there is no other comment, can this ticket be flagged as positive_review ? I agreed with what I read but I know can't follow you on the "hardware/programming standard" ground :-)

Nathann

videlec commented 10 years ago

comment:24

Just to mention: if there is an overflow, the error is very explicit

sage: from sage.rings.sum_of_squares import two_squares_pyx
sage: two_squares_pyx(2**65+19)
Traceback (most recent call last):
...
OverflowError: long int too large to convert

83660e46-0051-498b-a8c1-f7a7bd232b5a commented 10 years ago

comment:25

Replying to @videlec:

Just to mention: if there is an overflow, the error is very explicit

sage: from sage.rings.sum_of_squares import two_squares_pyx
sage: two_squares_pyx(2**65+19)
Traceback (most recent call last):
...
OverflowError: long int too large to convert

I rather meant overflows in the C/Cython code, where just modulo arithmetic happens, without throwing exceptions. (With using unsigned longs you're probably on the safe[r] side, but that's something I(?)'ll have to check more carefully...)

83660e46-0051-498b-a8c1-f7a7bd232b5a commented 10 years ago

comment:26

P.S.: On modern (64-bit) AMD CPUs, not using squaring is significantly faster, while on Intel CPUs there's apparently not much of a difference, with I think a slight advantage of doing squaring explicitly (i.e., integer multiplication), but it's hard to get reliable timings there.

(And GCC definitely doesn't do the transformation either way.)

videlec commented 10 years ago

comment:27

Replying to @nexttime:

Replying to @videlec:
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx
sage: two_squares_pyx(2**65+19)
Traceback (most recent call last):
...
OverflowError: long int too large to convert
I rather meant overflows in the C/Cython code, where just modulo arithmetic happens, without throwing exceptions. (With using unsigned longs you're probably on the safe[r] side, but that's something I(?)'ll have to check more carefully...)

There are only two places were overflow may occur:

the input
when we rely on long double to perform the square root (i.e. are we sure that an unsigned long fits into a long double ?) All integers that appear inside the function are smaller than the input. So, no trouble here.

Vincent

videlec commented 10 years ago

comment:28

Replying to @nexttime:

P.S.: On modern (64-bit) AMD CPUs, not using squaring is significantly faster, while on Intel CPUs there's apparently not much of a difference, with I think a slight advantage of doing squaring explicitly (i.e., integer multiplication), but it's hard to get reliable timings there.

(And GCC definitely doesn't do the transformation either way.)

I would prefer to keep the multiplication as it is much more readable.

jdemeyer commented 10 years ago

comment:29

I think the long double is overkill and would revert to using sqrt().

It decreases portability (sqrtl() is C99 so at least you need to compile in C99 mode and not all systems have this function).
You will never want to call two_squares in this range (2^53) anyway.
For three_squares and four_squares, there are many solutions, so it doesn't matter if the square root is off by 1.

jdemeyer commented 10 years ago

comment:31

Also, these functions should be interruptible: http://www.sagemath.org/doc/developer/coding_in_cython.html#interrupt-and-signal-handling

83660e46-0051-498b-a8c1-f7a7bd232b5a commented 10 years ago

comment:32

Replying to @jdemeyer:

I think the long double is overkill and would revert to using sqrt().

It decreases portability (sqrtl() is C99 so at least you need to compile in C99 mode and not all systems have this function).

:-) Yes, C99 is only 15 years old. (Note that a lot of code in Sage requires C99, and any recent GCC defaults to -std=gnu99.)

While there are (very few) platforms where the math library at least used to lack some long double functions, I really wouldn't care here.

By the way, IMHO i and j should be unsigned rather than unsigned long. (And as mentioned, I'd rather use uint32_t and uint64_t, or even better, uint_fast32_t instead of the former.)

Note that in

        j = <unsigned> sqrt[l](<[long] double> n)
        j += 1 - j%2

j still cannot overflow, since sqrt[l]() returns UINT_MAX (which is odd) for n close to ULONG_MAX.

However, I'd limit the input to values far below ULONG_MAX (more precisely, UINT64_MAX) anyway.

You will never want to call two_squares in this range (2^53) anyway.

Yes, and I'd actually bail out above some cutoff.

I have to check the details, but I think that <unsigned long> sqrt(<double> n) is actually sufficiently precise that it computes the exact integer square root.

? You mean (unsigned long)sqrt((double)N) * (unsigned long)sqrt((double)N) <= N (for all N < 2⁵³, say), or (unsigned long)sqrt((double)(N*N)) == N (for all N < 2²⁶, say)?

I'm ok with using sqrt() if we limit the input accordingly (to at most 2⁵³-1).

jdemeyer commented 10 years ago

comment:33

Replying to @nexttime:

While there are (very few) platforms where the math library at least used to lack some long double functions, I really wouldn't care here.

Yes indeed, I was referring to the math library part of C99.

I have to check the details, but I think that <unsigned long> sqrt(<double> n) is actually sufficiently precise that it computes the exact integer square root.

(unsigned long)sqrt((double)(N*N)) == N (for all N < 2²⁶, say)?

I meant this, which is true even for much larger values of N (as long as N is representable in a double). But we actually require sqrt((double) n)**2 <= n for all inputs n, which is not true with my formula.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

39c83ca trac #16374: unsigned long -> uint_fast32_t and signal handling

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Changed commit from 17883e3 to 39c83ca

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Changed commit from 39c83ca to 8ba94c4

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

8ba94c4 trac #16374: fix a signal handling error

videlec commented 10 years ago

comment:36

Hi,

The file now uses uint_fast32_t and sqrt. There is a check on the input size of the three cython functions (since on some platform a uint_fast32_t is actually 64 bits). But I did not know how to test it properly... so it is # not tested.

Please review! Vincent

jdemeyer commented 10 years ago

comment:37

Small comment: you could make the input type uint32_t and use uint_fast32_t for the computation. Then your exceptions will be platform-independent.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

`1106b2e`	`trac #16374: fix input overflow + some cleaning`
`fdf90e6`	`trac #16374: overflow test + is_sum_of_two_squares_pyx`

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 10 years ago

Changed commit from 8ba94c4 to fdf90e6

videlec commented 10 years ago

comment:39

Replying to @jdemeyer:

Small comment: you could make the input type uint32_t and use uint_fast32_t for the computation. Then your exceptions will be platform-independent.

Great. Thanks for the suggestion. Done in the new version and the tests are updated. I also add a function is_sum_of_square_pyx which is much faster than using sum_of_squares_pyx with try/except, especially if the answer is False.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 10 years ago

comment:40

Hllooooooooo !

I re-read this thing again and I think that it is good to go. I added a small commit in public/16374 to replace 2^32 with 2^{32} which produced a bad output in the html doc.

If you have no objection to that you can set the ticket to positive_review :-P

Nathann

videlec commented 10 years ago

Changed commit from fdf90e6 to 1b965f8

videlec commented 10 years ago

Reviewer: Jeroen Demeyer, Leif Leonhardy, Nathann Cohen

videlec commented 10 years ago

Changed branch from u/vdelecroix/16374 to public/16374

sagemath / sage

better two_squares, three_squares, four_squares for small input #16374