Use divide-and-conquer for faster factorials

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

BPO	8692
Nosy	@rhettinger, @mdickinson, @abalkin
Files	factorial4.py factorial-no-recursion.patch factorial-test.patch: Improve unit tests for math.factorial factorial-precompute-partials.patch factorial-speed.sh: Script to test factorial speed for several n factorial.py factorial.patch: Divide-and-conquer factorial algorithm factorial2.patch factorial3.patch factorial4.patch unnamed

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/mdickinson' closed_at = created_at = labels = ['extension-modules', 'performance'] title = 'Use divide-and-conquer for faster factorials' updated_at = user = 'https://bugs.python.org/stutzbach' ``` bugs.python.org fields: ```python activity = actor = 'mark.dickinson' assignee = 'mark.dickinson' closed = True closed_date = closer = 'mark.dickinson' components = ['Extension Modules'] creation = creator = 'stutzbach' dependencies = [] files = ['17320', '17321', '17326', '17328', '17329', '17343', '17345', '17348', '17351', '17352', '17354'] hgrepos = [] issue_num = 8692 keywords = ['patch', 'needs review'] message_count = 72.0 messages = ['105537', '105548', '105551', '105552', '105553', '105557', '105559', '105561', '105562', '105563', '105564', '105565', '105568', '105579', '105590', '105592', '105593', '105594', '105596', '105597', '105600', '105602', '105603', '105604', '105607', '105610', '105653', '105655', '105657', '105660', '105661', '105663', '105664', '105665', '105666', '105668', '105678', '105679', '105680', '105681', '105682', '105683', '105684', '105689', '105691', '105715', '105729', '105731', '105754', '105756', '105757', '105759', '105760', '105770', '105777', '105782', '105783', '105784', '105797', '105801', '105806', '105807', '105808', '105811', '105812', '105813', '105814', '105815', '105816', '105817', '105818', '105851'] nosy_count = 6.0 nosy_names = ['rhettinger', 'mark.dickinson', 'belopolsky', 'draghuram', 'stutzbach', 'Alexander.Belopolsky'] pr_nums = [] priority = 'normal' resolution = 'accepted' stage = 'patch review' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue8692' versions = ['Python 3.2'] ```

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

(Making an educated guess about who to add to the Nosy list)

Attached is a patch to improve math.factorial by using a divide-and-conquer algorithm. The old algorithm performs n-1 multiplications, mostly on numbers with a very large number of digits.

The algorithm in this patch:

implicitly factors out all powers of two and applies a left-shift at the end.
performs roughly half as many multiplications (around n/2 + 2*lg n)
groups the multiplications so most multiplications are on small numbers
uses a lookup table for n \<= 12

There are faster factorial algorithms available, but they're significantly more complicated and rely on a fast prime factorization function. This one is around 125 lines of code in C (with comments). I have a pure-Python version that's around 25 lines of code, if anyone is interested.

Here are some timing results for different values of n:

n : old algorithm : new algorithm 1 0.14 us 0.10 us 10 0.63 us 0.12 us 13 0.81 us 0.76 us 100 12.6 us 4.92 us 1k 576 us 118 us 10k 53.6 ms 8.16 ms 100k 12.1 s 443 ms 1M 27 min 23 s 10M forget it 20 min

I tested that both algorithms return the same answer for all values of n up to 10,000.

abalkin commented 14 years ago

I've noticed that your patch changes

>>> math.factorial(2.**63)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

to

>>> math.factorial(2.**63)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: factorial() not defined for negative values

While the error message is wrong in both cases, I think OverflowError is a better exception in this case and there should not be a difference between math.factorial(2.**63) and math.factorial(2**63) behavior.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

On Tue, May 11, 2010 at 4:18 PM, Alexander Belopolsky \report@bugs.python.org\ wrote:

While the error message is wrong in both cases, I think OverflowError is a better exception in this case and there should not be a difference between math.factorial(2.**63) and math.factorial(2**63) behavior.

Good catch! I will fix that tomorrow.

mdickinson commented 14 years ago

Yes, I'm interested in seeing the pure Python version. It could go into test_math, and would be a useful form of documentation.

Are there sufficient tests already in test_math.py to exercise the code thoroughly, or are more needed?

I'll try to find time to do a thorough code review in the next few days.

abalkin commented 14 years ago

I also started to wonder if a tighter upper limit for an acceptable argument can be found.

In discussion of bpo-2138 I saw the following exchange:

"""

Should there be some upper limit on the argument math.factorial would take?

I'd say not. Any such limit would be artificial, and an arbitrary choice. Just let the natural time and space requirements be the limiting factor. """ - msg62541 - Mark Dickinson -

Still, the original and proposed implementations bail out if n is larger than system LONG_MAX. This is not a limitation because because the result for LONG_MAX! would exceed the number of digits that python long integer can hold.

It seems to me that the value of n for which number of digits will exceed sys.maxsize can be estimated fairly accurately using Stirling formula. Only two values are relevant in practice - one for sys.maxsize = 2**63-1 and the other for sys.maxsize = 2**31-1. These values can be hardcoded and factorial can quickly report the case when n! will exceed maxsize digits instead of hanging until memory is exhausted.

mdickinson commented 14 years ago

On Tue, May 11, 2010 at 11:33 PM, Alexander Belopolsky \report@bugs.python.org\ wrote:

It seems to me that the value of n for which number of digits will exceed sys.maxsize can be estimated fairly accurately using Stirling formula. Only two values are relevant in practice - one for sys.maxsize = 2**63-1 and the other for sys.maxsize = 2**31-1. These values can be hardcoded and factorial can quickly report the case when n! will exceed maxsize digits instead of hanging until memory is exhausted.

Sure, bailing out for ridiculously large arguments sounds fine to me. On a 64-bit machine, there can be at most 2**61 4-byte digits, each digit giving containing 30 bits of the long. So the maximum representable long (under the implausible assumption that someone could actually find 2**63 bytes of storage) would be around 2*(30*2*\61). The following quick search gives me a value of around 1.18e18 for the first n such that n! exceeds this value:

from math import log, lgamma

def bisect(f, a, b):
    c = (a + b)/2.0
    while a != c and b != c:
        a, b = (a, c) if f(c) else (c, b)
        c = (a + b)/2.0
    return c

BOUND = 2**62*15*log(2)
print(bisect(lambda x: lgamma(x) > BOUND, 2.0, 1e30))

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

On Tue, May 11, 2010 at 5:33 PM, Alexander Belopolsky \report@bugs.python.org\ wrote:

It seems to me that the value of n for which number of digits will exceed sys.maxsize can be estimated fairly accurately using Stirling formula. Only two values are relevant in practice - one for sys.maxsize = 2**63-1 and the other for sys.maxsize = 2**31-1. These values can be hardcoded and factorial can quickly report the case when n! will exceed maxsize digits instead of hanging until memory is exhausted.

Isn't that adding an extra check in every case to speed up a you-can't-seriously-expect-that-to-work corner case?

abalkin commented 14 years ago

On Tue, May 11, 2010 at 7:42 PM, Daniel Stutzbach \report@bugs.python.org\ wrote: ..

Isn't that adding an extra check in every case to speed up a you-can't-seriously-expect-that-to-work corner case?

The check is cheap - just a machine integer comparison, so I would not even take that cost into account. In my view math.factorial() is primarily of interest in educational settings where it is quite likely that someone would be curious enough to pass sys.maxsize to it.

The main value in setting a theoretically justified limit is that overflow exception can carry a meaningful message, e.g. "factorial result would have too many digits", rather than an unhelpful "Python int too large to convert to C long".

abalkin commented 14 years ago

On Tue, May 11, 2010 at 7:42 PM, Daniel Stutzbach \report@bugs.python.org\ wrote: ..

Isn't that adding an extra check in every case ...

Speaking of micro-optimizations, did you consider a better than naive algorithm for "Count the number of set bits in n" in your patch? HAKMEM 169 comes to mind and being a divide and conquer too, it seems like a good fit. Certainly an overkill if used just for math.factorial(), but this is probably a useful function to have around.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

On Tue, May 11, 2010 at 7:15 PM, Alexander Belopolsky \report@bugs.python.org\ wrote:

The main value in setting a theoretically justified limit is that overflow exception can carry a meaningful message, e.g. "factorial result would have too many digits", rather than an unhelpful "Python int too large to convert to C long".

I'm pretty sure this is an orthogonal issue to speeding up math.factorial. If you want to improve the error and/or impose a tighter maximum limit on n, would you mind opening it as a separate issue?

I like extra checks and I like speed, but I can't think about adding extra checks and a speed patch at the same time. ;-)

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

On Tue, May 11, 2010 at 7:38 PM, Alexander Belopolsky \report@bugs.python.org\ wrote:

Speaking of micro-optimizations, did you consider a better than naive algorithm for "Count the number of set bits in n" in your patch? HAKMEM 169 comes to mind and being a divide and conquer too, it seems like a good fit. Certainly an overkill if used just for math.factorial(), but this is probably a useful function to have around.

I considered it, but decided to stick with code readability and portability. Counting the number of set bits is only done once per factorial, so it's not on the critical path.

FWIW, the following page has a pretty extensive summary of performance comparisons of different solutions to the "count the set bits" problem: http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html

abalkin commented 14 years ago

On Tue, May 11, 2010 at 9:07 PM, Daniel Stutzbach \report@bugs.python.org\ wrote:

Daniel Stutzbach \daniel@stutzbachenterprises.com\ added the comment:

On Tue, May 11, 2010 at 7:38 PM, Alexander Belopolsky \report@bugs.python.org\ wrote: > Speaking of micro-optimizations, did you consider a better than naive > algorithm for "Count the number of set bits in n" in your patch? > .. I considered it, but decided to stick with code readability and portability.

Speaking of readability, with a separate popcount() function, you can simply write

nminusnumbits_ob = PyLong_FromLong(n - popcount(n))

and eliminate not only the loop, but also num_bits and tmp variables from math_factorial()

The popcount function can be defined as a __builtin_popcount on GCC and your loop otherwise.

Counting the number of set bits is only done once per factorial, so it's not on the critical path.

I agree, performance consideration are completely irrelevant here.

Similarly, while unlikely to improve performance, I would prefer not to use any bit-trick implementation of ilog2 (in a separate function, of course) instead of calling floating point log2. In my head, an assignment of floating point result to an integer variable always raises a red flag.

Another readability nit: for me k % 2 == 0 is a more readable check for even number than (k & 1) != 1. Performance-wise the two choices are the same, and either can be improved by combining k = (n + m) / 2 and k & 1 into one ldiv call.

I have not tried to do it, but my gut feeling is that factorial_part_product() can benefit from passing semi-open rather than closed interval. (Also renaming n and m to start and stop in this case will help understanding.)

abalkin commented 14 years ago

On Tue, May 11, 2010 at 10:19 PM, Alexander Belopolsky \report@bugs.python.org\ wrote: ..

Similarly, while unlikely to improve performance, I would prefer not to use any bit-trick implementation of ilog2 (in a separate function, of course) instead of calling floating point log2. In my head, an assignment of floating point result to an integer variable always raises a red flag.

Searching for relevant past issues, I've come across a similar sentiment from Mark:

""" floor(log(n, 2)) is poor code. This is not supposed to be a dramatic statement, just a statement of fact. Its correctness is dependent on minute details of floating point. It is poor code in exactly the same way that "while x \< 1.0: x += 0.1" is poor code---behaviour in boundary cases is almost entirely unpredictable. """ - msg78066 - Mark Dickinson -

I also noticed that the reference implementation does not require this calculation because the loop is implemented recursively. Did you find recursive implementation to give worse performance?

mdickinson commented 14 years ago

Some quick comments:

(1) Agree about the extra bound checks: let's treat those as a separate issue and just concentrate on performance here.

(2) log2 isn't portable: it's not part of C89 (though it is in C99) and I don't think it exists on Windows. Which is a shame, since it probably *does* reliably work well for boundary cases on most platforms. I'm embarrassed to read that snippet that Alexander found, but it's true that an alternative like log(n)/log(2) has problems in boundary cases, thanks to the usual floating-point issues. There's a bit-counting method in the int.bit_length() implementation (in Objects/longobject.c) that could possibly be re-used here. Alternatively, if a simple for-loop to count bits doesn't have any noticeable impact on speed, then we could use that.

(3) Is the 'count set bits' code a bottleneck? If not, then it looks fine to me as it is. Doesn't it just get called once per factorial computation?

(4) I wonder whether the recursion in factorial_part_product slows things down; it might be interesting to compare with an iterative version (but still one that clumps together small pieces rather than doing lots of small*big multiplies). It seems possible that the cost of the recursive calls is insignificant compared to the cost of the various Py* calls, though.

(5) Was there a reason for using long rather than unsigned long for the computations? Using unsigned long would give you an easily computable multiply_cutoff, and save the need for that extra static variable (it could be a macro instead).

abalkin commented 14 years ago

On Tue, May 11, 2010 at 5:58 PM, Mark Dickinson \report@bugs.python.org\ wrote:

Yes, I'm interested in seeing the pure Python version.

Here is my translation of the reference implementation.

It could go into test_math, and would be a useful form of documentation.

Note that I copied the reference implementation recursive logic rather than that in the proposed patch. It may be better for documentation this way.

If we end up using something like this in documentation, I would rename nminusnumofbits() to something more readable. Maybe "ntwos" or "count_trailing_zeros" with an explanation why number of factors of 2 in factorial(n) is n - popcount(n).

abalkin commented 14 years ago

On Tue, May 11, 2010 at 10:19 PM, Alexander Belopolsky \report@bugs.python.org\ wrote: ..

Another readability nit: for me k % 2 == 0 is a more readable check for even number than (k & 1) != 1. Performance-wise the two choices are the same, and either can be improved by combining k = (n + m) / 2 and k & 1 into one ldiv call.

Strike this comment. For some reason I though GCC would optimize division by 2 and inline ldiv, but apparently neither is true.

Still,

if ((k & 1) != 1) k = k - 1;

looks odd to me. Maybe k += (k & 1) - 1?

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

Thank you both for the valuable feedback. I'm working on a revised patch that incorporates your many suggestions.

I decided to use an iterative version for two reasons:

the reference has a note at the bottom in a tiny font suggesting it
I found it counter-intuitive to implement product() as a recursive function. I had a much easier time matching the code up with the formula when it was implemented iteratively.

I agree that my use of log2() was an ugly hack. ;-) I'll fix that.

Would it be worthwhile to create a pybits.h and .c that defines _Py_FindLastSetBit and _Py_CountSetBits? (with appropriate logic in the .h and configure.in to use system/compiler versions if available)

There are already two implementations of find-last-set-bit in Python: bits_in_digit() in Objects/longobject.c and hi0bits() in Python/dtoa.c. It would be nice to consolidate them.

(hi0bits counts the leading 0 bits which is a trivial transformation of finding the highest set bit)

I don't know of anyplace else in Python that needs count-set-bits.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

On Wed, May 12, 2010 at 10:31 AM, Alexander Belopolsky \report@bugs.python.org\ wrote:

if ((k & 1) != 1) k = k - 1;

looks odd to me. Maybe k += (k & 1) - 1?

If we change the logic slightly to put the odd entry on the right instead of the left, we can do:

        k = (n + m) / 2;
        k |= 1; /* Ensure that k is odd */
        left = factorial_part_product(n, k-2);
        if (left == NULL) goto done;
        right = factorial_part_product(k, m);
        if (right == NULL) goto done;

That will split 1*3*5*7*9*11 into (1*3*5) (7*9*11), just like the old code. It will split 1*3*5*7*9 into (1*3) (5*7*9) while the old code did (1*3*5) (7\9), which is fine.

It's easier to read and fewer operations. :-)

mdickinson commented 14 years ago

Now that I've looked at the patch properly:

I'm +1 on including some version of this patch. At the time that the original math.factorial went in (see bpo-2138) we were hurrying a bit to get it in before the first 2.6 beta, so ended up with a simple implementation, with the understanding (I think) that it could be improved later.

It looks like you and Alexander are doing a great job of hammering out the fine detail; I only have a few comments at this stage. I predict that you're not going to like the first one ;-). The others are just technical issues.

(1) In the interests of simplicity, please could we lose the 'long' optimization in factorial_part_product? That is, get rid of the if (m == n+2) branch, and just let that case recurse normally---which means that we end up calling PyNumber_Multiply in some cases instead of doing a C long by C long multiplication. Then we can get rid of multiply_cutoff entirely. I'm +1 on the improved algorithm, and I realize that the optimization does have an effect (some unscientific tests showed me a 18% speed increase in typical cases) but for me this optimization goes past the simplicity/speed tradeoff. There's always the option of adding something like this back in later, once the new algorithm's gone in.

(2) You're missing a Py_DECREF(part) in factorial_loop, so factorial(n) leaks references (for n > 12).

(3) The line "k = (n + m) / 2;" in factorial_part_product invokes undefined behaviour (from signed overflow) if n and m are large. We're not going to get meaningful results in this case anyway, but UB should be avoided if at all possible. Perhaps rewrite this as "k = n + (m - n) / 2;"?

(4) And please do restore the PyLong_FromDouble line in the main routine, rather than using a C double-to-long cast. The direct conversion again leads to undefined behaviour for large doubles (cf. C99 6.3.1.4,p2).

mdickinson commented 14 years ago

(cf. C99 6.3.1.4,p2).

Oops. C99 6.3.1.4,p1. That'll teach me not to cite chapter and verse.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

On Wed, May 12, 2010 at 12:59 PM, Mark Dickinson \report@bugs.python.org\ wrote:

(4) And please do restore the PyLong_FromDouble line in the main routine, rather than using a C double-to-long cast. The direct conversion again leads to undefined behaviour for large doubles (cf. C99 6.3.1.4,p2).

I was planning to add a "if (dx > (double) LONG_MAX)" check. Would that be sufficient?

mdickinson commented 14 years ago

I was planning to add a "if (dx > (double) LONG_MAX)" check. Would that be sufficient?

Hmm. It's subtle. On an LP64 machine, LONG_MAX will be 2**63-1, which isn't exactly representable as a double. So (double) LONG_MAX would likely be 2.0**63 exactly (depending on rounding mode, but round-half-to-even is probably a safe assumption unless someone's deliberately messing around). Then that check would fail for dx == 2.**63 exactly.

Turn it into '>=' rather than '>', and I *think* it's okay.

abalkin commented 14 years ago

On Wed, May 12, 2010 at 4:50 AM, Mark Dickinson wrote: ..

(4) I wonder whether the recursion in factorial_part_product slows things down; it might be interesting to compare with an iterative version (but still one that clumps together small pieces rather than doing lots of small*big multiplies). It seems possible that the cost of the recursive calls is insignificant compared to the cost of the various Py* calls, though.

I am attaching a little study of three different part_product implementations in python: the recursive one, straight product, and not-recursive binary division:

$ ./python.exe -m timeit -s "import factorial3 as fm;
fm.partial_product = fm.partial_product; f = fm.factorial " "f(10000)"
10 loops, best of 3: 66.1 msec per loop
$ ./python.exe -m timeit -s "import factorial3 as fm;
fm.partial_product = fm.partial_product1; f = fm.factorial "
"f(10000)"
10 loops, best of 3: 67.6 msec per loop
$ ./python.exe -m timeit -s "import factorial3 as fm;
fm.partial_product = fm.partial_product2; f = fm.factorial "
"f(10000)"
10 loops, best of 3: 43.4 msec per loop

The last one seems to b a clear winner, but I am not certain where the gain comes from - no recursion or first by last instead of ith by (i+1)st multiplication. Also python recursion overhead is probably different from C.

mdickinson commented 14 years ago

Interesting---thanks for the analysis!

Realistically though, I don't see an iterative version of factorial_part_product as an option for the C patch, without a significant increase in complexity. Daniel's current patch is remarkably clean and simple, and I'd like to keep it that way.

I did think about various evil schemes for an iterative version, but didn't come up with anything I'd want to see in the Python codebase. (The worst of these schemes involved using a union of long and PyObject to try to increase the possibilities for doing simple C long multiplication, and using the fact that you can easily tell the difference between an odd long and a (4-byte aligned) PyObject just by looking at the last bit. But I'm fairly sure that comes under the 'evil' heading. :) )

abalkin commented 14 years ago

Here is one more datapoint.

$ ./python.exe -m timeit -s "import factorial4 as fm;
fm.partial_product = fm.partial_product; f = fm.factorial " "f(10000)"
10 loops, best of 3: 66.1 msec per loop
[32794 refs]
$ ./python.exe -m timeit -s "import factorial4 as fm;
fm.partial_product = fm.partial_product3; f = fm.factorial "
"f(10000)"
10 loops, best of 3: 63 msec per loop
[32794 refs]
$ ./python.exe -m timeit -s "import factorial4 as fm;
fm.partial_product = fm.partial_product2; f = fm.factorial "
"f(10000)"
10 loops, best of 3: 43.3 msec per loop

partial_product3 multiplies adjacent numbers instead of first by last. I am not sure it reproduces the order of multiplication in the recursive version exactly, but it does show that the order of multiplication matters a lot.

I wonder if one could write an elegant recursive version that would multiply first by last in partial_product.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

On Wed, May 12, 2010 at 3:34 PM, Alexander Belopolsky \report@bugs.python.org\ wrote:

I wonder if one could write an elegant recursive version that would multiply first by last in partial_product.

That could be done with a three-parameter partial_product, where the third parameter designates how many numbers to multiply. Something like this:

part_product(9, 17, 5) -> 9*11*13*15*17 part_product(9, 17, 2) -> 9*17 part_product(11, 15, 3) -> 11*13*15

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

After experimenting with changing the order of the multiplications and not having much luck, I went back and looked for other differences in Alexander's Python functions that might cause the speed difference. I believe partial_product2 is fast because it performs all of its operations in-place using a single list, whereas partial_product3 creates a new list during each iteration. Here's a version of partial_product3 that operates in-place and is just as fast as partial_product2:

def partial_product3(j, i):
    a = [l << 1 | 1 for l in range(j, i + 1)]
    n = len(a)
    while 1:
        if n == 1:
            return a[0]
        half = n//2
        for k in range(0,half):
            a[k] = a[k*2] * a[k*2+1]
        if n & 1:
            a[half] = a[n-1]
        n = half

abalkin commented 14 years ago

Daniel,

Your variant does not seem to work:

>>> def partial_product3(j, i):
...     a = [l << 1 | 1 for l in range(j, i + 1)]
...     n = len(a)
...     while 1:
...         if n == 1:
...             return a[0]
...         half = n//2
...         for k in range(0,half):
...             a[k] = a[k*2] * a[k*2+1]
...         if n & 1:
...             a[half] = a[n-1]
...         n = half

>>> partial_product3(4,6)
99
>>> 9 * 11 * 13
1287

but it looks like I posted a buggy version of partial_product2 as well. Strange because I thought I had enough doctests to catch the errors. I'll redo the testing.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

Isn't it amazing how fast one can make incorrect code? ;-)

Here is a fixed version of my partial_product3, but now it is no faster than partial_product.

def partial_product3(j, i):
    a = [l << 1 | 1 for l in range(j, i + 1)]
    n = len(a)
    while 1:
        if n == 1:
            return a[0]
        half = n//2
        for k in range(0,half):
            a[k] = a[k*2] * a[k*2+1]
        if n & 1:
            a[half] = a[n-1]
            n = half + 1
        else:
            n = half

mdickinson commented 14 years ago

Does anyone feel like doing a speed comparison between Daniel's C patch and a version with a direct no-frills iterative version of factorial_part_product (i.e., just a simple 'for (i = n; i \<= m; i += 2) { \<multiply running product by i> }? I have a sneaking suspicion that the iterative version will be faster even for quite large values of n, but I'd be happy to be proven wrong.

mdickinson commented 14 years ago

And why are we trying to speed up the pure Python factorial code here? I don't imagine that those speed differences are going to translate well to C.

abalkin commented 14 years ago

On Thu, May 13, 2010 at 5:31 PM, Mark Dickinson \report@bugs.python.org\ wrote:

And why are we trying to speed up the pure Python factorial code here?

I would expect that for large factorials the performance will be determined by the number of long multiplications and the size of multiplicands.

I don't imagine that those speed differences are going to translate well to C.

The differences between recursive and non-recursive versions are not likely to translate well, but the difference (if any) between the order of multiplication most likely will.

In any case, I am attaching fixed version of factorial4.

$ ./python.exe -m timeit -s "from factorial4 import f0 as f" "f(10000)"
10 loops, best of 3: 65.5 msec per loop
$ ./python.exe -m timeit -s "from factorial4 import f1 as f" "f(10000)"
10 loops, best of 3: 66.9 msec per loop
$ ./python.exe -m timeit -s "from factorial4 import f2 as f" "f(10000)"
10 loops, best of 3: 56.5 msec per loop
$ ./python.exe -m timeit -s "from factorial4 import f3 as f" "f(10000)"
10 loops, best of 3: 63 msec per loop

mdickinson commented 14 years ago

I would expect that for large factorials the performance will be determined by the number of long multiplications and the size of multiplicands.

Okay, but I don't think we should care about the performance of *really* large factorials for Python. People who care about every bit of speed in that situation should be using GMP or something similar. An optimization that only makes a difference for (say) factorial(50000) or higher isn't going to make much difference to most Python users. Optimizations that speed up, say, factorial(n) for n \<= 1000 would seem more valuable.

The differences between recursive and non-recursive versions are not likely to translate well, but the difference (if any) between the order of multiplication most likely will.

Perhaps. But the differences between the various Python versions here are small enough that they could easily be swamped by other factors involved in the Python-to-C translation.

We already have a working C patch here (modulo minor issues), and I'd like to move forward with that patch; I think this issue discussion is getting a bit side-tracked.

grumpily-yours...

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

Speaking of getting side-tracked, I didn't see an answer to a question I asked earlier. I'd like to get some feedback before I proceed with revising the patch.

For the find-last-set-bit (to replace log2) and count-set-bits operations, would it be worthwhile to create a pybits.h and .c that defines _Py_FindLastSetBit and _Py_CountSetBits? (with appropriate logic in the .h and configure.in to use system/compiler versions if available)

There are already two implementations of find-last-set-bit in Python: bits_in_digit() in Objects/longobject.c and hi0bits() in Python/dtoa.c, which I could consolidate.

Alternately, I could just add static functions to mathmodule.c with the simplest possible implementation (they're only called once per factorial, so the performance impact is minimal).

abalkin commented 14 years ago

On Thu, May 13, 2010 at 6:09 PM, Daniel Stutzbach \report@bugs.python.org\ wrote: ..

Speaking of getting side-tracked, I didn't see an answer to a question I asked earlier. I'd like to get some feedback before I proceed with revising the patch.

I did not respond because I don't have an answer. :-) Maybe it is time to revisit Raymond's reasoning in msg63969 where he argued that factorial should be an int method because it would benefit from access to integer implementation details.

He also argued that """ Compared to numbits() and isqrt(), a factorial() method is more basic in that it is self explanatory and everyone knows what it means by the time they are in middle school. """ nevertheless numbits() aka bit_length() has become an int method, but factorial landed in math.

For the find-last-set-bit (to replace log2) and count-set-bits operations, would it be worthwhile to create a pybits.h and .c that defines _Py_FindLastSetBit and >_Py_CountSetBits? (with appropriate logic in the .h and configure.in to use system/compiler versions if available)

Since it is unlikely that either factorial() or bit_length() will be moved from their current location, I would be +1 on creating pybits. I would give them different names, though. Popcount seems to be the most popular name for CountSetBits and instead of FindLastSetBit, a more common function seems to be nlz, number of leading zeros. On the other hand, since bit_length is already established, maybe _Py_bit_length_long(long) and (if needed) _Py_bit_length_int(int) would make sense.

There are already two implementations of find-last-set-bit in Python: bits_in_digit() in Objects/longobject.c and hi0bits() in Python/dtoa.c, which I could consolidate.

How did you find it?! I hope we'll end up with a better name than that.

Alternately, I could just add static functions to mathmodule.c with the simplest possible implementation (they're only called once per factorial, so the performance impact is minimal).

In the scope of this issue I would say do that. Pybits proposal seem to deserve it's own issue.

abalkin commented 14 years ago

On Wed, May 12, 2010 at 3:47 PM, Mark Dickinson \report@bugs.python.org\ wrote:

... Realistically though, I don't see an iterative version of factorial_part_product as an option for the C patch, without a significant increase in complexity. Daniel's current patch is remarkably clean and simple, and I'd like to keep it that way.

I am attaching an iterative version in C patch. I don't think it represents a dramatic increase in complexity ~ 40 lines over Daniel's 30.

I did think about various evil schemes for an iterative version, ...

I would not say my patch is evil, maybe a bit naughty. :-) It can be made less evil by resizing the list instead of filling its tail with NULLs or more evil by using a tuple instead of list.

The performance appears to be identical to Daniel's with no small integer multiplication optimization. The later gives about 2% improvement.

abalkin commented 14 years ago

On Thu, May 13, 2010 at 5:58 PM, Mark Dickinson \report@bugs.python.org\ wrote:

Optimizations that speed up, say, factorial(n) for n \<= 1000 would seem more valuable.

I am attaching a variant of my patch which precomputes partial products that fit in 32 bit unsigned int. This results in speed up over Daniel's code which varies from 1.8x for 20! down to 7% for 100! and no measurable improvement for 1000!.

This optimization is orthogonal to the choice of partial_product algorithm and can be easily extended on platforms with long long to precompute 64 bit products.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

That's a clever idea. Do you have a Python script that generates the precomputed values?

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

It's a little too clever though. It gives the wrong answer for 29!.

I'll have a revised version of my patch done sometime tomorrow.

abalkin commented 14 years ago

Oh, my!

How did that last term get into precomputed list?!

It should have been

precomputed[] = {3, 15, 5, 35, 315, 63, 693, 9009, 1287, 19305, 328185, 36465, 692835, 14549535, 1322685, 30421755, 760543875, 58503375, 1579591125ul};

The next term is 36 bit
>>> product(i<<1|1 for i in range(7,15))
45808142625
>>> _.bit_length()
36

I'll replace the patch.

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

Attached is a patch to improve the unit tests for the factorial function.

To compute the check value, it keeps a running total instead of recomputing the factorial from scratch inside the loop. It checks up to range(999) and is quite fast. The previous code checked range(10).

It also adds the following checks:

0! == 1
(sys.maxsize+1)! => OverflowError
10e100 => OverflowError

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

Attached is a simple bash script to run math.factorial(n) through timeit for several values of n. It makes comparing the speed of different builds MUCH easier.

abalkin commented 14 years ago

Mark> Does anyone feel like doing a speed comparison between Daniel's C patch and a version with a direct no-frills iterative version of factorial_part_product (i.e., just a simple 'for (i = n; i \<= m; i += 2) { \<multiply running product by i> }?

Not a direct answer to your question, but replacing a bisect with a no-frills algorithm in my precompute patch gives the following timings:

n bisect no-frills 100 38.9 us 38.5 us 1000 .904 ms 1.08 ms 10000 35.4 ms 50.3 ms

The no-frills product still takes 20 lines of C code though:

    n = last - first + 1;
    if (n <= 0)
        return PyLong_FromLong(1L);

    result = PyLong_FromUnsignedLong(ODD(first));
    if (result == NULL)
        return NULL;
    for  (i = 1; i < n; ++i) {
        x = PyLong_FromUnsignedLong(ODD(first + i));
        if (x == NULL)
            goto error;
        tmp = PyNumber_Multiply(result, x);
        Py_DECREF(x);
        if (tmp == NULL)
            goto error;
        Py_DECREF(result);
        result = tmp;
    }
    return result;
 error:
    Py_DECREF(result);
    return NULL;

mdickinson commented 14 years ago

Daniel Stutzbach \daniel@stutzbachenterprises.com\ added the comment:

Speaking of getting side-tracked, I didn't see an answer to a question I asked earlier. I'd like to get some feedback before I proceed with revising the patch.

For the find-last-set-bit (to replace log2) and count-set-bits operations, would it be worthwhile to create a pybits.h and .c that defines _Py_FindLastSetBit and _Py_CountSetBits? (with appropriate logic in the .h and configure.in to use system/compiler versions if available)

How about putting them in pymath.c and pymath.h? Then there's no need for new files. There's the possible issue that the two bit-counting methods operate on different types, though.

There are already two implementations of find-last-set-bit in Python: bits_in_digit() in Objects/longobject.c and hi0bits() in Python/dtoa.c, which I could consolidate.

dtoa.c should be left alone, ideally: it's currently almost completely self-contained, and also very close to the original dtoa.c from David Gay, which makes it easy to incorporate fixes from upstream. So it's just Objects/longobject.c that would share the code.

Alternately, I could just add static functions to mathmodule.c with the simplest possible implementation (they're only called once per factorial, so the performance impact is minimal).

That would work, too.

mdickinson commented 14 years ago

I am attaching an iterative version in C patch.

Thanks for doing this comparison.

The performance appears to be identical to Daniel's with no small integer multiplication optimization.

Okay, let's stick with the recursive version, then. It has the advantage that it uses less space (no need to store the entire list of odd terms).

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

Attached is an updated patch.

In addition to the code cleanup and bug fix suggestions, it includes a new base-case for factorial_partial_product. Instead of:

if (n >= m) return n if (n + 2 == m) return n*m otherwise divide-and-conquer

It now does:

if (the_answer_will_fit_in_unsigned_long) compute_product_via_tight_loop() return answer otherwise divide-and-conquer

It's around half the code of the previous factorial_partial_product(), if you don't count comments. It's also much faster for small n and somewhat faster for moderate n.

original patch: 13: 0.848 usec 50: 2.4 usec 100: 4.68 usec 1000: 121 usec 10000: 7.68 msec 100000: 434 msec

new patch: 13: 0.5 usec 50: 1.2 usec 100: 2.28 usec 1000: 100 usec 10000: 7.32 msec 100000: 447 msec

I also experimented with adding Alexander's precompute logic to my factorial_loop. However, it did not result in a significant speedup, since all of the cases that could leverage the precompute table now execute in the new fast base case.

The new patch includes the new unit tests I uploaded earlier.

abalkin commented 14 years ago

I agree, recursive version of partial_product is much simpler to follow. While allocation of all numbers can be avoided in iterative version, doing so would further complicate code with little benefit.

I still believe, however that an iterative version can benefit from redefining partial_product to be product(2*i+1 for i in range(start, stop)).

Divide and conquer algorithm for that is simply

def partial_product(start, stop):
    length = stop - start
    .. handle length = 1 and 2 ..
    middle = start + (length >> 1)
    return partial_product(start, middle) * partial_product(middle, stop)

I would also reconsider the decision of using iterative outer loop. Recursive outer loop matching redefined partial_product() can be written as

def loop(n):
    p = r = 1
    if n > 2:
        p, r = loop(n >> 1)
        p *= partial_product((n >> 2) + (n >> 1 & 1),
                             (n >> 1) + (n & 1))
        r *= p
    return p, r

which I believe is not harder to follow than the iterative alternative and does not require bit_length calculation.

I am replacing my python implementation, factorial.py, with the one that uses algorithms outlined above.

If there is any interest, I can convert it to C.

abalkin commented 14 years ago

I still believe, however that an iterative version can benefit ..

s/iterative/recursive/

cfc9f3e0-e33f-4ecd-9ddd-4123842d6c1e commented 14 years ago

I made a few minor updates to the patch.

It redefines partial_product to product(range(n, m, 2)), which saved me a few operations and is more Pythonic than what I had before. :-)

(Not quite what Alexander was after, but it's a step in the right direction. His proposed defining of partial_product would have complicated my new base case.)

mdickinson commented 14 years ago

The patch looks good. Just one issue: I think the overflow check for num_operands * last_bit is bogus. It's possible for the product to overflow and still end up being less than num_operands. How about:

if (num_operands \<= BITS_IN_LONG && num_operands * last_m_bit \<= BITS_IN_LONG) ...

instead? If num_operands \<= BITS_IN_LONG then there's no danger of overflow in the multiplication, and if num_operands > BITS_IN_LONG then the second test will certainly fail.

Apart from that, and some very minor style issues (e.g., please put the body of an if statement on the following line; I don't think this is codified in PEP-7, but it seems to be the prevailing style in the Python codebase) I think the patch is good to go.

[Alexander] "... benefit from redefining partial_product to be product(2*i+1 for i in range(start, stop))"

You mean rewriting to use half-open intervals [start, stop) rather than closed intervals? In which case I agree it would look slightly cleaner that way, but I don't have particularly strong feelings on the issue. I'll leave it up to Daniel whether he wants to change this or not.

"I would also reconsider the decision of using iterative outer loop."

I actually prefer the iterative version, perhaps out of a Python-bred aversion to recursion (in languages where the frames end up on the stack, anyway). The bit_length calculation doesn't seem like a terrible price to pay to me. Again, I leave this up to Daniel.

Marking this as accepted: Daniel, if you like I can apply the overflow check change suggested above and the minor style fixes and apply this; or if you want to go another round and produce a third patch, that's fine too. Let me know.

(N.B. If you do produce another patch, please name it something different and don't remove the earlier version---removing the earlier patch versions makes life harder for anyone trying to follow this issue.)

python / cpython

Use divide-and-conquer for faster factorials #52938