Closed videlec closed 10 years ago
Branch: u/vdelecroix/16374
Hi,
I implemented a much better version than my proposal in #16308 (linear complexity in the input for two_squares_pyx
). It is not yet branched in arith.py
, I will do this when #16308 is merged.
timings:
sage: from sage.rings.arith_pyx import four_squares_pyx
sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)")
5 loops, best of 3: 46 ms per loop
Best Vincent
New commits:
7cf5fff | trac #16374: new file arith_pyx.pyx |
Branch pushed to git repo; I updated commit sha1. New commits:
f1354e6 | trac #16374: two small optimization |
Hello !
I added a commit in u/ncohen/16374 which contains some tests and comments.
I have two questions to ask though :
Why ii <= n/2
and not ii<=jj
?
Why a "while" instead of a "if" in the following code ?
j = (<unsigned int> sqrt(<double> n)) + 1 # (rounding is toward zero)
while j*j > n:
j -= 1
Besides, there is a way to avoid many multiplications, and perhaps it can improve the code a bit (no idea, profiling is the only way :-P
)
sage: i = 5
sage: ii = i**2
sage: i**2 == ii
True
sage: ii+=2*i+1
sage: i+=1
sage: i**2 == ii
True
Nathann
Hi Nathann,
Thanks for reading it.
I added a commit in u/ncohen/16374 which contains some tests and comments.
Great. I also have something important to modify in three_squares_pyx
and four_squares_pyx
. It is much faster to first remove a huge square and then try to decompose the rest into a sum of less squares. The result is not the smallest for lexicographical order but the timings are x4 better.
It seems that you like C function declaration of the form int f(int res[3])
. It makes the specifications clearer but at compilation the results are strictly equivalent.
I have two questions to ask though :
- Why
ii <= n/2
and notii<=jj
?
Right.
- Why a "while" instead of a "if" in the following code ?
j = (<unsigned int> sqrt(<double> n)) + 1 # (rounding is toward zero) while j*j > n: j -= 1
I am not sure of the specification (if any) of sqrt
which deals with double... I will try to improve it.
Besides, there is a way to avoid many multiplications, and perhaps it can improve the code a bit (no idea, profiling is the only way
:-P
)
I thought about it. I am not sure that ii += 2*i + 1
is cheaper than ii = i*i
? Will have a look.
Vincent
Branch pushed to git repo; I updated commit sha1. New commits:
a2b31ca | trac #16374: reviewer comment + further optimization |
Hi,
I confirm that there is no difference between "ii = ii" and "ii -= 2i + 1". Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3". In the new commit there are the modifications relative to your comment and mine.
New timings
sage: from sage.rings.arith_pyx import four_squares_pyx
sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)")
25 loops, best of 3: 11.3 ms per loop
About the dependencies, I am not sure whether this ticket should go before or after #16308...
Vincent
Changed dependencies from #16308 to none
Instead of adding a new module arith_pyx
, either
fast_arith.pyx
instead.Replying to @videlec:
Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3".
Sure, write the most readable code and let the compiler do its job.
About the dependencies, I am not sure whether this ticket should go before or after #16308...
Even better: make these two tickets independent and then make a third ticket to integrate them.
I would recommend to use unsigned long
instead of instead int
. If could imagine that the sum of 4 squares function is useful for numbers larger than 2^32
.
Also a typo: plateform
-> platform
Replying to @jdemeyer:
I would recommend to use
unsigned long
instead ofinstead int
. If could imagine that the sum of 4 squares function is useful for numbers larger than2^32
.
Of course. But for large entries, we would better use the *_squares
from #16308. I will do some timings, but if we loose something on small integers I will not change it.
Also a typo:
plateform
->platform
Thank you. I will modify it.
What's the point of <unsigned int> sqrt(<double> n + .5)
? Are you worried that the libm sqrt
function does not round correctly? I believe that IEEE-754 requires this.
Replying to @jdemeyer:
What's the point of
<unsigned int> sqrt(<double> n + .5)
? Are you worried that the libmsqrt
function does not round correctly? I believe that IEEE-754 requires this.
I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this .5
?
Replying to @videlec:
I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this
.5
?
It is certainly not a compiler issue, but a libm
/hardware issue. I would expect IEEE-754 and would remove the + 0.5
.
Replying to @jdemeyer:
Replying to @videlec:
I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this
.5
?It is certainly not a compiler issue, but a
libm
/hardware issue. I would expect IEEE-754 and would remove the+ 0.5
.
Yes. Integers up to 253-1 can be represented exactly in a double
, at least on the plat[e]forms we support.
I like the type instead int
btw. (Google would have asked: "Did you mean instant int
?", or probably "insteady int
"?)
Regarding that, I'd write versions for / using integers of known, platform-independent width, that is, use uint8_t
, uint16_t
, uint32_t
etc., with a wrapper function that delegates to the appropriate function, or use if-then-elif-else such that the compiler is aware of the exact range of numbers. (Separate functions could more easily be checked for overflow conditions as well.)
Replying to @videlec:
I confirm that there is no difference between "ii = ii" and "ii -= 2i + 1".
There of course is; just look at what the (C) compiler produces from it. (I don't know what Cython itself does; it may optimize it, but may also further obfuscate the code.)
On older machines, presumably any 32-bit processor [we support], multiplication (or squaring) is more expensive than addition, increment, and/or a shift. (And its cost depends on the number of 1-bits in the multiplicands.)
Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3".
Nope, you have to be more explicit. It of course does some strength reduction, e.g. replacing multiplication/division by powers of two by shifts, additions or nice addressing modes etc., but you cannot expect it to run a theorem prover... ;-)
Hi,
The file is now sum_of_squares.pyx
.
I changed the unsigned int
to unsigned long
as Jeroen suggested. The loss in terms of timing seems to be around 1% (see below). Nevertheless, the function makes no sense for very large input as it is much faster to called the Python function from arith
(see also below).
For the rounding issue with sqrt
, it uses now <unsigned long> sqrt(<double> n)
and there is an exhaustive test in the doc of the function two_squares_pyx
.
unsigned int
sage: from sage.rings.sum_of_squares import four_squares_pyx
sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)")
25 loops, best of 3: 11 ms per loop
unsigned long
sage: from sage.rings.sum_of_squares import four_squares_pyx
sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)")
25 loops, best of 3: 11.3 ms per loop
Very large input
sage: from sage.rings.sum_of_squares import two_squares_pyx
sage: %timeit two_squares_pyx(2**51+21)
10 loops, best of 3: 88.1 ms per loop
sage: %timeit two_squares(2**51+21)
1000 loops, best of 3: 1.04 ms per loop
Vincent
I wouldn't call 251+21 "very large", at least not if the functions now take unsigned long
s. (Ok, on most of current machines that exceeds an unsigned int
.)
While we're at it, if n
is unsigned long
, you should take (unsigned long)sqrtl((long double)n)
, otherwise you'd probably have to add 1 to the result in case n
is a perfect square or slightly above. (On 32-bit machines, where unsigned long
s are only 32-bit as well, that doesn't matter.)
Branch pushed to git repo; I updated commit sha1. New commits:
17883e3 | trac #16374: replace sqrt by sqrtl |
Replying to @nexttime:
I wouldn't call 251+21 "very large", at least not if the functions now take
unsigned long
s. (Ok, on most of current machines that exceeds anunsigned int
.)
Very large meant "the function is no more efficient with such input".
While we're at it, if
n
isunsigned long
, you should take(unsigned long)sqrtl((long double)n)
, otherwise you'd probably have to add 1 to the result in casen
is a perfect square or slightly above. (On 32-bit machines, whereunsigned long
s are only 32-bit as well, that doesn't matter.)
Done.
Vincent
Well... Theeeeeeeeeeeeeeeeeeeeeeen if there is no other comment, can this ticket be flagged as positive_review
? I agreed with what I read but I know can't follow you on the "hardware/programming standard" ground :-)
Nathann
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx
sage: two_squares_pyx(2**65+19)
Traceback (most recent call last):
...
OverflowError: long int too large to convert
Replying to @videlec:
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: two_squares_pyx(2**65+19) Traceback (most recent call last): ... OverflowError: long int too large to convert
I rather meant overflows in the C/Cython code, where just modulo arithmetic happens, without throwing exceptions. (With using unsigned long
s you're probably on the safe[r] side, but that's something I(?)'ll have to check more carefully...)
P.S.: On modern (64-bit) AMD CPUs, not using squaring is significantly faster, while on Intel CPUs there's apparently not much of a difference, with I think a slight advantage of doing squaring explicitly (i.e., integer multiplication), but it's hard to get reliable timings there.
(And GCC definitely doesn't do the transformation either way.)
Replying to @nexttime:
Replying to @videlec:
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: two_squares_pyx(2**65+19) Traceback (most recent call last): ... OverflowError: long int too large to convert
I rather meant overflows in the C/Cython code, where just modulo arithmetic happens, without throwing exceptions. (With using
unsigned long
s you're probably on the safe[r] side, but that's something I(?)'ll have to check more carefully...)
There are only two places were overflow may occur:
unsigned long
fits into a long double
?)
All integers that appear inside the function are smaller than the input. So, no trouble here.Vincent
Replying to @nexttime:
P.S.: On modern (64-bit) AMD CPUs, not using squaring is significantly faster, while on Intel CPUs there's apparently not much of a difference, with I think a slight advantage of doing squaring explicitly (i.e., integer multiplication), but it's hard to get reliable timings there.
(And GCC definitely doesn't do the transformation either way.)
I would prefer to keep the multiplication as it is much more readable.
I think the long double
is overkill and would revert to using sqrt()
.
It decreases portability (sqrtl()
is C99 so at least you need to compile in C99 mode and not all systems have this function).
You will never want to call two_squares
in this range (2^53
) anyway.
For three_squares
and four_squares
, there are many solutions, so it doesn't matter if the square root is off by 1.
Also, these functions should be interruptible: http://www.sagemath.org/doc/developer/coding_in_cython.html#interrupt-and-signal-handling
Replying to @jdemeyer:
I think the
long double
is overkill and would revert to usingsqrt()
.
- It decreases portability (
sqrtl()
is C99 so at least you need to compile in C99 mode and not all systems have this function).
:-) Yes, C99 is only 15 years old. (Note that a lot of code in Sage requires C99, and any recent GCC defaults to -std=gnu99
.)
While there are (very few) platforms where the math library at least used to lack some long double functions, I really wouldn't care here.
By the way, IMHO i
and j
should be unsigned
rather than unsigned long
. (And as mentioned, I'd rather use uint32_t
and uint64_t
, or even better, uint_fast32_t
instead of the former.)
Note that in
j = <unsigned> sqrt[l](<[long] double> n)
j += 1 - j%2
j
still cannot overflow, since sqrt[l]()
returns UINT_MAX
(which is odd) for n close to ULONG_MAX
.
However, I'd limit the input to values far below ULONG_MAX
(more precisely, UINT64_MAX
) anyway.
- You will never want to call
two_squares
in this range (2^53
) anyway.
Yes, and I'd actually bail out above some cutoff.
- I have to check the details, but I think that
<unsigned long> sqrt(<double> n)
is actually sufficiently precise that it computes the exact integer square root.
? You mean (unsigned long)sqrt((double)N) * (unsigned long)sqrt((double)N) <= N
(for all N < 253, say), or (unsigned long)sqrt((double)(N*N)) == N
(for all N < 226, say)?
I'm ok with using sqrt()
if we limit the input accordingly (to at most 253-1).
Replying to @nexttime:
While there are (very few) platforms where the math library at least used to lack some long double functions, I really wouldn't care here.
Yes indeed, I was referring to the math library part of C99.
- I have to check the details, but I think that
<unsigned long> sqrt(<double> n)
is actually sufficiently precise that it computes the exact integer square root.
(unsigned long)sqrt((double)(N*N)) == N
(for all N < 226, say)?
I meant this, which is true even for much larger values of N
(as long as N
is representable in a double
). But we actually require sqrt((double) n)**2 <= n
for all inputs n
, which is not true with my formula.
Branch pushed to git repo; I updated commit sha1. New commits:
39c83ca | trac #16374: unsigned long -> uint_fast32_t and signal handling |
Branch pushed to git repo; I updated commit sha1. New commits:
8ba94c4 | trac #16374: fix a signal handling error |
Hi,
The file now uses uint_fast32_t
and sqrt
. There is a check on the input size of the three cython functions (since on some platform a uint_fast32_t
is actually 64 bits). But I did not know how to test it properly... so it is # not tested
.
Please review! Vincent
Small comment: you could make the input type uint32_t
and use uint_fast32_t
for the computation. Then your exceptions will be platform-independent.
Replying to @jdemeyer:
Small comment: you could make the input type
uint32_t
and useuint_fast32_t
for the computation. Then your exceptions will be platform-independent.
Great. Thanks for the suggestion. Done in the new version and the tests are updated. I also add a function is_sum_of_square_pyx
which is much faster than using sum_of_squares_pyx
with try/except
, especially if the answer is False
.
Hllooooooooo !
I re-read this thing again and I think that it is good to go. I added a small commit in public/16374 to replace 2^32
with 2^{32}
which produced a bad output in the html doc.
If you have no objection to that you can set the ticket to positive_review
:-P
Nathann
Reviewer: Jeroen Demeyer, Leif Leonhardy, Nathann Cohen
Changed branch from u/vdelecroix/16374 to public/16374
We implement a much more efficient version of square decomposition for small values.
CC: @nathanncohen @jdemeyer @nexttime
Component: number theory
Author: Vincent Delecroix
Branch/Commit:
51e36d8
Reviewer: Jeroen Demeyer, Leif Leonhardy, Nathann Cohen
Issue created by migration from https://trac.sagemath.org/ticket/16374