Quadratic time internal base conversions

tim-one commented 2 years ago

BPO	46558
Nosy	@tim-one, @cfbolz, @sweeneyde
Files	todecstr.py todecstr.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'performance'] title = 'Quadratic time internal base conversions' updated_at = user = 'https://github.com/tim-one' ``` bugs.python.org fields: ```python activity = actor = 'tim.peters' assignee = 'none' closed = True closed_date = closer = 'tim.peters' components = ['Interpreter Core'] creation = creator = 'tim.peters' dependencies = [] files = ['50593', '50595'] hgrepos = [] issue_num = 46558 keywords = [] message_count = 9.0 messages = ['411962', '411966', '411969', '411971', '412120', '412122', '412172', '412191', '412192'] nosy_count = 3.0 nosy_names = ['tim.peters', 'Carl.Friedrich.Bolz', 'Dennis Sweeney'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue46558' versions = [] ```

PR: gh-99063

PR: gh-99070

PR: gh-99073

PR: gh-99094

tim-one commented 2 years ago

Our internal base conversion algorithms between power-of-2 and non-power-of-2 bases are quadratic time, and that's been annoying forever ;-) This applies to int\<->str and int\<->decimal.Decimal conversions. Sometimes the conversion is implicit, like when comparing an int to a Decimal.

For example:

>> a = 1 \<\< 1000000000 # yup! a billion and one bits >> s = str(a)

I gave up after waiting for over 8 hours, and the computation apparently can't be interrupted.

In contrast, using the function in the attached todecstr.py gets the result in under a minute:

>>> a = 1 << 1000000000
>>> s = todecstr(a)
>>> len(s)
301029996

That builds an equal decimal.Decimal in a "clever" recursive way, and then just applies str to _that_.

That's actually a best case for the function, which gets major benefit from the mountains of trailing 0 bits. A worst case is all 1-bits, but that still finishes in under 5 minutes:

>>> a = 1 << 1000000000
>>> s2 = todecstr(a - 1)
>>> len(s2)
301029996
>>> s[-10:], s2[-10:]
('1787109376', '1787109375')

A similar kind of function could certainly be written to convert from Decimal to int much faster, but it would probably be less effective. These things avoid explicit division entirely, but fat multiplies are key, and Decimal implements a fancier * algorithm than Karatsuba.

Not for the faint of heart ;-)

sweeneyde commented 2 years ago

Is this similar to https://bugs.python.org/issue3451 ?

tim-one commented 2 years ago

Dennis, partly, although that was more aimed at speeding division, while the approach here doesn't use division at all.

However, thinking about it, the implementation I attached doesn't actually for many cases (it doesn't build as much of the power tree in advance as may be needed). Which I missed because all the test cases I tried had mountains of trailing 0 or 1 bits, not mixtures.

So I'm closing this anyway, at least until I can dream up an approach that always works. Thanks!

tim-one commented 2 years ago

Changed the code so that inner() only references one of the O(log log n) powers of 2 we actually precomputed (it could get lost before if lo was non-zero but within n had at least one leading zero bit - now we pass the conceptual width instead of computing it on the fly).

tim-one commented 2 years ago

The test case here is a = (1 \<\< 100000000) - 1, a solid string of 100 million 1 bits. The goal is to convert to a decimal string.

Methods:

native: str(a)

numeral: the Python numeral() function from bpo-3451's div.py after adapting to use the Python divmod_fast() from the same report's fast_div.py.

todecstr: from the Python file attached to this report.

gmp: str() applied to gmpy2.mpz(a).

Timings:

native: don't know; gave up after waiting over 2 1/2 hours. numeral: about 5 1/2 minutes. todecstr: under 30 seconds. (*) gmp: under 6 seconds.

So there's room for improvement ;-)

But here's the thing: I've lost count of how many times someone has whipped up a pure-Python implementation of a bigint algorithm that leaves CPython in the dust. And they're generally pretty easy in Python.

But then they die there, because converting to C is soul-crushing, losing the beauty and elegance and compactness to mountains of low-level details of memory-management, refcounting, and checking for errors after every tiny operation.

So a new question in this endless dilemma: _why_ do we need to convert to C? Why not leave the extreme cases to far-easier to write and maintain Python code? When we're cutting runtime from hours down to minutes, we're focusing on entirely the wrong end to not settle for 2 minutes because it may be theoretically possible to cut that to 1 minute by resorting to C.

(*) I hope this algorithm tickles you by defying expectations ;-) It essentially stands numeral() on its head by splitting the input by a power of 2 instead of by a power of 10. As a result no divisions are used. But instead of shifting decimal digits into place, it has to multiply the high-end pieces by powers of 2. That seems insane on the face of it, but hard to argue with the clock ;-) The "tricks" here are that the O(log log n) powers of 2 needed can be computed efficiently in advance of any splitting, and that all the heavy arithmetic is done in the decimal module, which implements fancier-than-Karatsuba multiplication and whose values can be converted to decimal strings very quickly.

tim-one commented 2 years ago

Addendum: the "native" time (for built in str(a)) in the msg above turned out to be over 3 hours and 50 minutes.

9719010a-7a6d-45ea-a4c0-9c1ede24d0ea commented 2 years ago

Somebody pointed me to V8's implementation of str(bigint) today:

https://github.com/v8/v8/blob/main/src/bigint/tostring.cc

They say that they can compute str(factorial(1_000_000)) (which is 5.5 million decimal digits) in 1.5s:

https://twitter.com/JakobKummerow/status/1487872478076620800

As far as I understand the code (I suck at C++) they recursively split the bigint into halves using % 10^n at each recursion step, but pre-compute and cache the divisors' inverses.

tim-one commented 2 years ago

The factorial of a million is much smaller than the case I was looking at. Here are rough timings on my box, for computing the decimal string from the bigint (and, yes, they all return the same string):

native: 475 seconds (about 8 minutes) numeral: 22.3 seconds todecstr: 4.10 seconds gmp: 0.74 seconds

"They recursively split the bigint into halves using % 10^n at each recursion step". That's the standard trick for "output" conversions. Beyond that, there are different ways to try to use "fat" multiplications instead of division. The recursive splitting all on its own can help, but dramatic speedups need dramatically faster multiplication.

todecstr treats it as an "input" conversion instead, using the decimal module to work mostly in base 10. That, by itself, reduces the role of division (to none at all in the Python code), and decimal has a more advanced multiplication algorithm than CPython's bigints have.

tim-one commented 2 years ago

todecstr treats it as an "input" conversion instead, ...

Worth pointing this out since it doesn't seem widely known: "input" base conversions are _generally_ faster than "output" ones. Working in the destination base (or a power of it) is generally simpler.

In the math.factorial(1000000) example, it takes CPython more than 3x longer for str() to convert it to base 10 than for int() to reconstruct the bigint from that string. Not an O() thing (they're both quadratic time in CPython today).

adamant-pwn commented 1 year ago

Hi, why is this issue closed? What needs to be done to make it through?

tim-one commented 1 year ago

It's closed because nobody appears to be both willing and able to pursue it.

But it's on the edge regardless. In general, any number of huge-int algorithms could be greatly speeded, but that's a massive undertaking. It's why GMP exists, to push such things as far as possible, regardless of implementation complexity, effort, or bulk. Very capable GMP bindings for Python are already available (gmpy2).

As the timings here suggest, GMP will generally be faster than anything we may do anyway, because GMP never reaches a point where its authors say "good enough already".

That said, I expect the single most valuable bigint speedup CPython could implement would be to bigint division, along the lines of gh-47701. That could indirectly give major speed boosts to bigint->str and bigint modular pow() too.

bjorn-martinsson commented 1 year ago

Why do you need fast division? Why not just implement str to int convertion using this

12345678 = 1234 * 10^4 + 5678 = (12 * 10^2 + 34) * 10^4 + (56 * 10^2 + 78) = ...

style of d&c?

This would be simple to implement and would only require multiplication.

If $M(n)$ is the time it takes to multiply two $n$ bit numbers, then the cost of string to int convertion using this d&q is $T(n) = 2 T(n/2) + M(n)$.

If multiplication is done using Karatsuba ( $M(n) = O(n^{1.58})$ ) then $T(n) = 3 M(n)$. If multiplication is done in $M(n) = O(n \log n)$ time, then $T(n) = O(n \log^2 n)$.

Since Python currently uses Karatsuba for its big ints multiplication, this simple str to int convertion algorithm would run in $O(n^{1.58})$.

adamant-pwn commented 1 year ago

Yep, even simplest divide and conquer with native multiplication would already provide significant speed-up.

bjorn-martinsson commented 1 year ago

Also if you want something slightly smarter, since 10 is $2 * 5$ you can use bitshifts.

12345678 = 1234 * 10^4 + 5678 = ((1234 * 5^4) << 4) + 5678

bjorn-martinsson commented 1 year ago

I made an implementation of the basic d&q algorithm to test it out

pow5 = [5]
while len(pow5) <= 22:
    pow5.append(pow5[-1] * pow5[-1])

def str_to_int(s):
    def _str_to_int(l, r):
        if r - l <= 3000:
            return int(s[l:r])
        lg_split = (r - l - 1).bit_length() - 1
        split = 1 << lg_split
        return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
    return _str_to_int(0, len(s))

Running this locally, str_to_int is as fast as int at about 3000 digits. For 40000 digits str_to_int takes 0.00722 s and int takes 0.0125 s. For 400000 digits str_to_int takes 0.272 s and int takes 1.27 s. For 4000000 digits str_to_int takes 10.3 s and int takes 127 s.

Clearly str_to_int is subquadratic, with same time complexity as Karatsuba. While int is quadratic. Also worth noting is that with a faster big int mult, str_to_int would definitely run a lot faster.

bjorn-martinsson commented 1 year ago

I also tried out str_to_int with GMP (gmpy2.mpz) integers instead of Python's big integers to see what faster big int multiplication could lead to.

For 40000 digits str_to_int with GMP takes 0.000432 s. For 400000 digits str_to_int with GMP takes 0.00817 s. For 4000000 digits str_to_int with GMP takes 0.132 s.

So what Python actually needs is faster big int multiplication. With that you'd get a really fast string to int converter for free.

tim-one commented 1 year ago

All the earlier timing examples here were of int->str, not of str->int. For the latter case, which you're looking at, splitting a decimal string by a power of 10 is indeed trivial (conceptually - but by the time you finish coding it all in C, with all the memory-management, refcounting, and error-checking cruft debugged, not so much :wink:).

As already noted, the report's todecstr() does the harder int->str direction without division too, relying instead on faster-than-Karatsuba multiplication, by doing most of the arithmetic in a power-of-10 base to begin with (via the decimal module, which does implement one faster-than-Karatsuba multiplication method). Python's bigints are represented internally in a power-of-2 base, and there is no linear-time way to split one by a power of 10. So todecstr() doesn't even try to; instead it splits by various powers of 2.

But faster native bigint division could nevertheless buy a major speedup for int->str, as already demonstrated near the start here by showing timings for the numeral() function. Much faster than str(), although still much slower than todecstr().

And it could also buy major speedups for modular pow(), which is increasingly visible as people playing with crypto schemes move to fatter keys.

And it would, tautologically, speed bigint division too, which is a bottleneck all on its own in some apps.

gpshead commented 1 year ago

And it would, tautologically, speed bigint division too, which is a bottleneck all on its own in some apps.

It would be useful to have practical examples of actual Python applications that would benefit from any of: a) high performance huge int MUL or DIV b) high performance huge int to decimal or other non-binary-power base conversion. c) high performance huge int from decimal string conversion.

Including if they've adopted a third party library for this math and would no longer have any need for that if improved.

Where "high performance" means more than constant-time faster than what CPython int offers today. I'll leave the definition of "huge" up to you, but if the need is less than 5 of CPython's current implementation detail 30-bit value digits I'm skeptical. :)

Bigint's really feel like a neat computer science toy most of the time. We've got them, but what actually uses integers larger than 64 or 128 bits?

Similarly, once numbers get huge, what is the point of converting them to and from decimal? Decimal is for humans and humans don't readily comprehend precise numbers that large let alone type them correctly.

tim-one commented 1 year ago

"Practical applications" isn't really the name of this game :wink:.

Most complaints come from more-or-less newbies, who, e.g., play around in an interactive shell, and are surprised to find that sometimes the shell seems to freeze (but is really just waiting for a quadratic-time, or worse, bigint operation to finish).

People who know what they're doing find many uses for big bigints, but more of a theoretical, research, educational, or recreational bent. Like, algorithm design, combinatorics, number theory, or contributing to the wonderful On-Line Encyclopedia of Integer Sequences. They also come as incidental side effects of using power tools, like in a symbolic math package computing symbolic power series or (for related reasons) symbolic high derivatives (integer coefficients and/or denominators often grow at an exponential rate across terms). Or they can be almost the whole banana, such as scaling an ill-conditioned float matrix to use exact bigint arithmetic instead to compute an exact matrix inverse or determinant (where sticking to floats the result could be pure noise due to compounding rounding errors).

Not all that long ago, a number of people worked on adding "gonzo" optimizations to math.comb(). I played along, but was vocally disinclined to do ever more. Of course GMP still runs circles around CPython for many comb() bigint arguments. Such efforts would, IMO, have been better spent on speeding bigint // (which, as a matter of course, would also give another speed boost to comb() - and to any number of other bigint algorithms in users' own Python code - // is a fundamental building block, not "an application").

bjorn-martinsson commented 1 year ago

One good example of using big integers are math people using Python for some quick calculations. When you use functions like factorial, it is easy to get pretty big integers. I also think that math people generally prefer decimal numbers over hex.

Another example would be working with fractions. Just adding fractional numbers makes their denominators and numerators grow really quickly. I also don't think fractions support hex. Both reading and printing requires decimal as far as I can tell. For example you can do fractions.Fraction('1/2') but not fractions.Fraction('0x1/0x2').

The examples I've given above is the type of scripts that wont appear in things like Google's codebase, even if they are relatively common use cases.

Another important thing to note is that programmers trust the built in functions in Python to just work. That is the fundamental reason why int <=> str became an issue in the first place. This is also a great argument for why it is important to fix the time complexity issues with big ints.

nascheme commented 1 year ago

I think Tim's suggestion of switching to a Python implementation of a more efficient algorithm for large inputs would be a good approach. It is reasonable to not want to maintain a complicated high-performance algorithm written in C. This would be an interesting project for a student or someone wanting a challenge. Calling into Python from C is not too hard to do. Use a C-API to import a Python module and call a function inside of it.

gpshead commented 1 year ago

I'm reopening this as we seem to have agreement that doing something nicer here now that we've got a default limit is a good thing.

pochmann commented 1 year ago

Here's a decimal-based str_to_int (posting here as Tim suggested) that's ~4.8 times faster than int for 10^6 digits and ~32 times faster for 10^7 digits (measured with Python 3.8.12). It converts the string to Decimal, divides that into 1024-bit Decimal "digits", converts each to 128-byte bytes (via int), joins those and converts back to int.

from decimal import *
from time import time
from random import choices

setcontext(Context(prec=MAX_PREC, Emax=MAX_EMAX, Emin=MIN_EMIN))

def str_to_int_using_decimal(s, bits=1024):
    d = Decimal(s)
    div = Decimal(2) ** bits
    divs = []
    while div <= d:
        divs.append(div)
        div *= div
    digits = [d]
    for div in reversed(divs):
        digits = [x
                  for digit in digits
                  for x in divmod(digit, div)]
        if not digits[0]:
            del digits[0]
    b = b''.join(int(digit).to_bytes(bits//8, 'big')
                 for digit in digits)
    return int.from_bytes(b, 'big')

# Benchmark against int() on random 1-million digits string
funcs = [int, str_to_int_using_decimal]
s = ''.join(choices('123456789', k=1_000_000))
expect = None
for f in funcs * 3:
    t = time()
    result = f(s)
    print(time() - t, f.__name__)
    if expect is None:
        expect = result
    assert result == expect

tim-one commented 1 year ago

@pochmann, that's encouraging - but please post code people can run as-is. For example, this code uses Decimal but doesn't define it. I can easily guess what you intended there, but the first "big" example I tried (str(10**1000000)) died with decimal.Overflow when doing div *= div. If you need to fiddle the decimal context too (which I expect you do), explicitly show the code you believe is needed.

pochmann commented 1 year ago

Sorry. Ok, updated it to be a complete script.

pochmann commented 1 year ago

I btw measured on replit.com, as I'm only on a phone. Would like to see times with 3.10 on a normal PC.

tim-one commented 1 year ago

From a very quick run on Windows 10, 64-bit, quad core desktop box, under Python 3.10.8, using precisely (no changes) the code you posted:

4.04166316986084 int
1.9819040298461914 str_to_int_using_decimal
4.033673286437988 int
1.952543020248413 str_to_int_using_decimal
4.062668800354004 int
1.9575085639953613 str_to_int_using_decimal

But that's all I can make time for now. Boosting it to 10 million digits, I'm still waiting for the first line of output :wink:.

...

OK, first two lines of output at 10 million digits tell enough:

408.42323637008667 int
29.8494553565979 str_to_int_using_decimal

So int is clearly quadratic-time in the number of digits, while your code is asymptotically much better than that.

pochmann commented 1 year ago

Thanks. Smaller ratio than I got (for 1e7 digits I had 575 vs 18 seconds), bit disappointing.

tim-one commented 1 year ago

Oh, I don't know - x-platform optimization is a PITA. So many things can differ. May, e.g., look quite different under gcc with various optimization gimmicks.

More interesting was tossing Neil's PR's str_to_int() into the mix. Only one block of output, and no effort to keep the machine otherwise quiet - just looking for the "high order bits":

at a million digits
4.044649600982666 int
1.955308437347412 str_to_int_using_decimal
0.6150074005126953 str_to_int

at 10 million digits
405.7727048397064 int
30.155847549438477 str_to_int_using_decimal
24.715866088867188 str_to_int

So your code gets off to a worse start, but appears to have much better asymptotics (but not yet better enough to overtake str_to_int() at 10 million digits).

tim-one commented 1 year ago

But the new code wins convincingly at 100 million digits (and skipping int because it would take over 10 hours):

at 100 million digits
445.37026476860046 str_to_int_using_decimal
948.4797194004059 str_to_int

tim-one commented 1 year ago

News to me: I knew decimal had a fancier * scheme than just Karatsuba, but I didn't know decimal division also benefited. Looking at the code, for "big enough" decimals, it uses Newton (multiply-by-reciprocal) division. The final multiplication, and the reciprocal computation, both benefit from decimal's asymptotically superior very-fat-multiplicand multiplication.

pochmann commented 1 year ago

Yes, before I implemented my idea I checked decimal's speed for // and divmod. For large n, "n digits * n digits" looked about four times faster than "2n digits // n digits".

I'm not familiar with Newton's reciprocal way. Does it actually compute decimal reciprocals? My solution still works if I use Emin=0 instead of Emin=MIN_EMIN (which I blindly copied from the docs). Which I btw think would be more appropriate anyway, for your todecstr as well?

tim-one commented 1 year ago

gh-47701 starts with a discussion of (and Python code for) "Newton division".

"Newton's method" for computing a reciprocal of x starts with a guess r, and then iterates r = r*(2 - r*x). For example, start with x=3 and r=0.2. Note that no division is needed:

>>> x = 3
>>> r = 0.2
>>> r = r * (2 - r * x); r
0.27999999999999997
>>> r = r * (2 - r * x); r
0.32480000000000003
>>> r = r * (2 - r * x); r
0.33311488
>>> r = r * (2 - r * x); r
0.3333331901677568
>>> r = r * (2 - r * x); r
0.3333333333332718

and so on. The number of correct digits approximately doubles on each iteration. y/x is then computed as y * (1/x).

Code isn't that simple, of course, because no floats are actually used in this context, but rather scaled integers. Getting the last retained bit exactly right is delicate.

I also doubt setting Emin is needed here, but it's what the decimal docs recommend when trying to do "exact arbitrary precision arithmetic". It's dirt cheap to set, and so I just think it's maximally clear to do exactly what the docs spell out. There's likewise no compelling reason to enable the Inexact trap, but it acts as a dirt-cheap way to assert that the computations we're doing really are delivering exact results.

bjorn-martinsson commented 1 year ago

I made a very bare-bones implementation of Schönhage-Strassen's algorithm in Python as a proof of concept. It probably shouldn't even be called Schönhage-Strassen, but I don't know what else to call it. Feel free to try it out! It should have the same time complexity as Schönhage-Strassen.

strassen.py

```py def shift(x, m, b): # Calculate (x << b) % (2**m + 1) b %= 2 * m if b >= m: x = -x b -= m upper = x >> (m - b) x -= upper << (m - b) return (x << b) - upper # NTT using Cooley-Tukey, a divide and conquer algorithm # Note that Cooley-Tukey requires n to be a power of two def NTT(A, m, root = 1): n = len(A) if n == 1: return A root2 = root * 2 % (2 * m) even = NTT(A[::2], m, root2) odd = NTT(A[1::2], m, root2) # Apply the twiddle factor odd = [shift(odd[k], m, root * k) for k in range(n//2)] return [even[k] + odd[k] for k in range(n//2)] +\ [even[k] - odd[k] for k in range(n//2)] def INTT(A, m, root = 1): A2 = NTT(A, m, -root) inverse_shift = -m.bit_length() return [shift(a, m, inverse_shift) for a in A2] def int_to_limbs(x, limb, n): # Convert big integer x to base 'limb' (a list of length n) if n == 1: return [x] n2 = n//2 upper = x >> n2 * limb lower = x - (upper << n2 * limb) return int_to_limbs(lower, limb, n2) + int_to_limbs(upper, limb, n - n2) def limbs_to_int(A, limb): # Convert a number given in base 'limb' into a big integer n = len(A) if n == 1: return A[0] n2 = n//2 return limbs_to_int(A[:n2], limb) + (limbs_to_int(A[n2:], limb) << (n2 * limb)) def ssa(x1, x2): """ Calculate x1 * x2 using Schonhagen-Strassen's algorithm """ n1 = x1.bit_length() n2 = x2.bit_length() # Find smallest mod 2**m + 1 that will work def check(m): # if m is too small, return 0 # Otherwise return smallest working diget size dig = 1 while True: n_dig1 = (n1 + dig - 1) // dig n_dig2 = (n2 + dig - 1) // dig n = n_dig1 + n_dig2 - 1 if n > 2 * m: # dig too small dig += 1 continue if max(n_dig1, n_dig2) * 2**(2*dig) <= 2**m: return dig else: return 0 # impossible m m = 1 while not check(m): m *= 2 # Only powers of 2 will work because we use Cooley-Tukey for the NTT limb = check(m) print('Multiplying a', n1, 'bit number with a', n2, 'bit number') print('Using mod 2**m + 1, with m =', m) print('Using limb of', limb, 'bits') print('Reduction to', 2*m, 'multiplications of numbers with', m, 'bits') import time L = time.perf_counter() A = int_to_limbs(x1, limb, 2 * m) B = int_to_limbs(x2, limb, 2 * m) A = NTT(A, m) B = NTT(B, m) # The big x1 * x2 calculation is split into 2*m, m bit multiplications C = [shift(a * b, m, 0) for a,b in zip(A,B)] C = INTT(C, m) C = [shift(c, m, 0) for c in C] y = limbs_to_int(C, limb) R = time.perf_counter() print('Time taken by Schonhagen-Strassen', R - L) L = time.perf_counter() x1 * x2 R = time.perf_counter() print('Built in time taken', R - L) assert y == x1 * x2 return y ssa(141**1000000, 99**1000000) ```

Here are some numbers to get a feel for it

multiplying a 7139552 bit number with a 6629357 bit number took 0.96 s for ssa() and 4.46 s for built in mult.
multiplying a 4247928 bit number with a 4087463 bit number took 0.92 s for ssa() and 2.08 s for buit in mult.
multiplying a 13591640 bit number with a 13693161 bit number took 3.69 s for ssa() and 12.3 s for built in mult.

So it breaks even at around 1e6 bits, and becomes significantly faster at 1e7 bits. Since it speeds up multiplication, it would also speed up algorithms dependent on multiplication (like str -> int conversion) by roughly the same amount.

I have some ideas for making the implementation faster. For example, I think I might be able to use a trick called the "square root of 2 trick" to speed it up by roughly a factor of 2. But I have not yet tried implementing that. If anyone has any ideas of how to improve this implementation in general, then I'm all ears.

tim-one commented 1 year ago

Cool! I never implemented a * algorithm fancier than Karatsuba, so I'm incompetent to review it unless you pointed to, and I could make time to digest, a paper explaining the method first.

There are many other algorithms that could benefit from this. Most critically to my eyes is that Mark Dickinson's "fast division" code in Neil's PR also essentially inherits its asymptotic behavior from *.

pochmann commented 1 year ago

@bjorn-martinsson I ran your code as-is and got:

Time taken by Schonhagen-Strassen 1.3782309700036421

The check part seems to be part of the algorithm, as it uses n1 and n2 from the inputs and you use its results m and limb in the "actual"(?) algorithm code. So I included the check loop in the timing (started the time before m = 1) and got:

Time taken by Schonhagen-Strassen 18.702068729995517

Is that check part a part of the algorithm or not?

I was trying to use ssa in some other benchmarking and now I don't know how to handle it. I removed the tail part with the builtin multiplication, but I don't know what to do with the check part.

bjorn-martinsson commented 1 year ago

@bjorn-martinsson I ran your code as-is and got:

Is that check part a part of the algorithm or not?

I was trying to use ssa in some other benchmarking and now I don't know how to handle it. I removed the tail part with the builtin multiplication, but I don't know what to do with the check part.

As I said this is just a proof of concept. I have not optimized the check() part at all. But it could easily be optimized. Only the number reported by print('Time taken by Schonhagen-Strassen', R - L) matters.

I have been working an actual proper implementation of Schönhagen-Strassen based on these notes. I will post it later today. It seems a lot faster, and it doesn't have the cheeky check() function.

bjorn-martinsson commented 1 year ago

Here comes the updated version. This is a proper implementation of Schönhage-Strassen following these notes. If you are new to FFT algorithms, and want to learn more about how FFT algorithms work, then I recommend reading this pdf.

Schönhage-Strassen implementation

```py def shift(x, m, b): # Calculate (x << b) % (2**m + 1) b %= 2 * m if b >= m: x = -x b -= m upper = x >> (m - b) x -= upper << (m - b) return (x << b) - upper # FFT using Cooley-Tukey, a divide and conquer algorithm # NOTE that Cooley-Tukey requires n to be a power of two def NTT(A, m, root = 1): n = len(A) if n == 1: return A root2 = root * 2 % (2 * m) even = NTT(A[::2], m, root2) odd = NTT(A[1::2], m, root2) # Multiply by roots of unity odd = [shift(odd[k], m, root * k) for k in range(n//2)] return [even[k] + odd[k] for k in range(n//2)] +\ [even[k] - odd[k] for k in range(n//2)] def INTT(A, m, root = 1): A2 = NTT(A, m, -root) inverse_shift = -(len(A).bit_length() - 1) return [shift(a, m, inverse_shift) for a in A2] def int_to_limbs(x, limb, n): # Convert big integer x to base 'limb' (a list of length n) if n == 1: return [x] n2 = n//2 upper = x >> n2 * limb lower = x - (upper << n2 * limb) return int_to_limbs(lower, limb, n2) + int_to_limbs(upper, limb, n - n2) def limbs_to_int(A, limb): # Convert a number given in base 'limb' into a big integer n = len(A) if n == 1: return A[0] n2 = n//2 return limbs_to_int(A[:n2], limb) + (limbs_to_int(A[n2:], limb) << (n2 * limb)) def negacyclic_convolution(A,B,m): n = len(A) assert len(A) == len(B) <= m assert m % n == 0 root = m // n # Weight A and B A = [shift(a, m, root * i) for i,a in enumerate(A)] B = [shift(b, m, root * i) for i,b in enumerate(B)] A = NTT(A, m, 2 * root) B = NTT(B, m, 2 * root) A = [shift(a, m, 0) for a in A] B = [shift(b, m, 0) for b in B] if m >= 1e5: # Use ssa if num of bits in a and b >= 1e5 C = [SSA(a, b, m) for a,b in zip(A,B)] else: C = [shift(a * b, m, 0) for a,b in zip(A,B)] C = INTT(C, m, 2 * root) C = [shift(c, m, -root * i) for i,c in enumerate(C)] C = [shift(c, m, 0) for c in C] return C def SSA(x1, x2, M = None): """ Calculates x1 * x2 if M is None Otherwise calculates x1 * x2 % (2**M + 1) """ n = 256 # Sṕlit factor, configurable, needs to be a factor of 2 if M is None: n1 = x1.bit_length() n2 = x2.bit_length() limb = (n1 + n2 + n - 1) // n M = limb * n else: assert M % n == 0 limb = M // n lower_lim_m = 2 * limb + n.bit_length() + 1 m = ((lower_lim_m + n - 1)//n) * n A = int_to_limbs(x1, limb, n) B = int_to_limbs(x2, limb, n) C = negacyclic_convolution(A, B, m) # Detect underflow in output of the negacyclic convolution C = [c - ((1 << m) + 1) if c >> (m - 1) else c for c in C] return limbs_to_int(C, limb) #a = 12345**20000; b = 13245**20000 a = 12345**200000; b = 13245**200000 #a = 12345**2000000; b = 13245**2000000 #a = 12345**4000000; b = 13245**4000000 print('Multiplying',a.bit_length(),'bit number with', b.bit_length(), 'bit number', flush=True) import time L = time.perf_counter() y1 = SSA(a,b) R = time.perf_counter() print('SSA took', R - L, flush=True) L = time.perf_counter() y2 = a * b R = time.perf_counter() print('Built in took', R - L) assert y1 == y2 ```

Here are some benchmarks

Multiplying 271833 bit number with 273864 bit number
SSA took 0.02662439999403432
Built in took 0.023815600012312643

Multiplying 27183279 bit number with 27386321 bit number
SSA took 6.168935099994997
Built in took 37.30421709999791

Multiplying 54366557 bit number with 54772641 bit number
testing mult
SSA took 12.37155929999426
Built in took 113.47091920000094

Looking at these numbers. It seems like SSA matches built in at around 3e5 bits. SSA seems to take almost linear time, for example doubling the input size from 2.7e7 bits to 5.4e7 bits increased the runtime by a factor of 2.005.

pochmann commented 1 year ago

@bjorn-martinsson Great, thanks! I tried benchmarking it against Decimal multiplication, but it looks odd, like sublinear complexity. I guess either the machine wasn't stable or SSA hasn't reached full speed yet at those sizes (like when you have T(n) = n²+1000n, it looks linear for a while until n is large enough that the n² part dominates the 1000n part).

Results

``` Multiplying 271,833 bit number with 273,864 bit number SSA took 0.0191372789995512 Decimal took 0.008566489996155724 SSA took 2.2339696898191903 times as long Multiplying 2,718,328 bit number with 2,738,633 bit number SSA took 0.29287373799888883 Decimal took 0.07542935800302075 SSA took 3.882755279279455 times as long Multiplying 27,183,279 bit number with 27,386,321 bit number SSA took 5.697764136995829 Decimal took 0.8656999059967347 SSA took 6.581685059137941 times as long Multiplying 54,366,557 bit number with 54,772,641 bit number SSA took 10.283152007003082 Decimal took 1.7681520909973187 SSA took 5.815762150417114 times as long Multiplying 108,733,114 bit number with 109,545,282 bit number SSA took 17.25244608199864 Decimal took 3.922430416001589 SSA took 4.398407174189027 times as long ```

Code

Same as yours, except the different benchmark at the end: ``` def shift(x, m, b): # Calculate (x << b) % (2**m + 1) b %= 2 * m if b >= m: x = -x b -= m upper = x >> (m - b) x -= upper << (m - b) return (x << b) - upper # FFT using Cooley-Tukey, a divide and conquer algorithm # NOTE that Cooley-Tukey requires n to be a power of two def NTT(A, m, root = 1): n = len(A) if n == 1: return A root2 = root * 2 % (2 * m) even = NTT(A[::2], m, root2) odd = NTT(A[1::2], m, root2) # Multiply by roots of unity odd = [shift(odd[k], m, root * k) for k in range(n//2)] return [even[k] + odd[k] for k in range(n//2)] +\ [even[k] - odd[k] for k in range(n//2)] def INTT(A, m, root = 1): A2 = NTT(A, m, -root) inverse_shift = -(len(A).bit_length() - 1) return [shift(a, m, inverse_shift) for a in A2] def int_to_limbs(x, limb, n): # Convert big integer x to base 'limb' (a list of length n) if n == 1: return [x] n2 = n//2 upper = x >> n2 * limb lower = x - (upper << n2 * limb) return int_to_limbs(lower, limb, n2) + int_to_limbs(upper, limb, n - n2) def limbs_to_int(A, limb): # Convert a number given in base 'limb' into a big integer n = len(A) if n == 1: return A[0] n2 = n//2 return limbs_to_int(A[:n2], limb) + (limbs_to_int(A[n2:], limb) << (n2 * limb)) def negacyclic_convolution(A,B,m): n = len(A) assert len(A) == len(B) <= m assert m % n == 0 root = m // n # Weight A and B A = [shift(a, m, root * i) for i,a in enumerate(A)] B = [shift(b, m, root * i) for i,b in enumerate(B)] A = NTT(A, m, 2 * root) B = NTT(B, m, 2 * root) A = [shift(a, m, 0) for a in A] B = [shift(b, m, 0) for b in B] if m >= 1e5: # Use ssa if num of bits in a and b >= 1e5 C = [SSA(a, b, m) for a,b in zip(A,B)] else: C = [shift(a * b, m, 0) for a,b in zip(A,B)] C = INTT(C, m, 2 * root) C = [shift(c, m, -root * i) for i,c in enumerate(C)] C = [shift(c, m, 0) for c in C] return C def SSA(x1, x2, M = None): """ Calculates x1 * x2 if M is None Otherwise calculates x1 * x2 % (2**M + 1) """ n = 256 # Sṕlit factor, configurable, needs to be a factor of 2 if M is None: n1 = x1.bit_length() n2 = x2.bit_length() limb = (n1 + n2 + n - 1) // n M = limb * n else: assert M % n == 0 limb = M // n lower_lim_m = 2 * limb + n.bit_length() + 1 m = ((lower_lim_m + n - 1)//n) * n A = int_to_limbs(x1, limb, n) B = int_to_limbs(x2, limb, n) C = negacyclic_convolution(A, B, m) # Detect underflow in output of the negacyclic convolution C = [c - ((1 << m) + 1) if c >> (m - 1) else c for c in C] return limbs_to_int(C, limb) import time from decimal import * setcontext(Context(prec=MAX_PREC, Emax=MAX_EMAX, Emin=MIN_EMIN)) for e in 20_000, 200_000, 2_000_000, 4_000_000, 8_000_000: a, b = 12345**e, 13245**e print(f'\nMultiplying {a.bit_length():,} bit number with {b.bit_length():,} bit number', flush=True) L = time.perf_counter() y1 = SSA(a,b) R = time.perf_counter() print('SSA took ', t1 := R - L, flush=True) del y1 a, b = Decimal(12345)**e, Decimal(13245)**e L = time.perf_counter() y2 = a * b R = time.perf_counter() print('Decimal took', t2 := R - L, flush=True) del y2 print('SSA took', t1/t2, 'times as long', flush=True) ```

tim-one commented 1 year ago

Very interesting! Here's a replacement for the tail end, which adds GMP to the mix (gmpy2 needs to be installed). The real purpose to adding GMP is to make it possible to check the SSA and Decimal results for equality in reasonable time (gmpy2.mpz(str(a_Decimal)) is the fastest way I know of to convert a Decimal to an internal binary format, exploiting GMP's base-conversion speed).

Sample output (not on a quiet box, and note that I added another exponent at the end):

Multiplying 271,833 bit number with 273,864 bit number
GMP took 0.0010034999995696126
SSA took 0.015251999999236432
dec took 0.009539000000586384
  dec/gmp   9.51
  ssa/gmp  15.20
  ssa/dec   1.60

Multiplying 2,718,328 bit number with 2,738,633 bit number
GMP took 0.012331699999776902
SSA took 0.2186782000007952
dec took 0.0884312999987742
  dec/gmp   7.17
  ssa/gmp  17.73
  ssa/dec   2.47

Multiplying 27,183,279 bit number with 27,386,321 bit number
GMP took 0.19443679999858432
SSA took 3.5190698000005796
dec took 0.9444862000000285
  dec/gmp   4.86
  ssa/gmp  18.10
  ssa/dec   3.73

Multiplying 54,366,557 bit number with 54,772,641 bit number
GMP took 0.3977763999992021
SSA took 6.972084000000905
dec took 1.9923395999994682
  dec/gmp   5.01
  ssa/gmp  17.53
  ssa/dec   3.50

Multiplying 108,733,114 bit number with 109,545,282 bit number
GMP took 0.8567504000002373
SSA took 13.989722300000722
dec took 4.0560991000002105
  dec/gmp   4.73
  ssa/gmp  16.33
  ssa/dec   3.45

Multiplying 217,466,228 bit number with 219,090,564 bit number
GMP took 1.8607211999988067
SSA took 32.826477100001284
dec took 8.73305900000014
  dec/gmp   4.69
  ssa/gmp  17.64
  ssa/dec   3.76

and the code:

from gmpy2 import mpz
from time import perf_counter as now

B1, B2 = 12345, 13245
for e in 20_000, 200_000, 2_000_000, 4_000_000, 8_000_000, 16_000_000:
    da, db = Decimal(B1)**e, Decimal(B2)**e
    ga, gb = mpz(B1)**e, mpz(B2)**e
    a, b = map(int, (ga, gb))
    print(f'\nMultiplying {a.bit_length():,} bit number with {b.bit_length():,} bit number',
          flush=True)

    L = now()
    y0 = ga * gb
    R = now()
    print('GMP took', t0 := R - L, flush=True)

    L = now()
    y1 = SSA(a, b)
    R = now()
    print('SSA took', t1 := R - L, flush=True)
    assert y0 == y1

    L = now()
    y2 = da * db
    R = now()
    print('dec took', t2 := R - L, flush=True)
    assert y0 == mpz(str(y2))

    print(f"  dec/gmp {t2/t0:6.2f}")
    print(f"  ssa/gmp {t1/t0:6.2f}")
    print(f"  ssa/dec {t1/t2:6.2f}")

matthiasgoergens commented 1 year ago

Perhaps a silly question: couldn't we 'just' make GMP a dependency and use their algorithms?

(I appreciated that GMP probably has an incompatible license. But I don't think it's the only game in town?)

tim-one commented 1 year ago

couldn't we 'just' make GMP a dependency and use their algorithms?

I'm too old to spend another second of my life hassling with license issues, so someone else will have to wrestle that bear. We definitely can't ship GMP with Python.

A missing bit of the puzzle is how many Python programmers will actually care. No way to guess. But, e.g., SageMath does ship with GMP, and SymPy and mpmath will use GMP too if they detect that gmpy2 is installed (else they'll use CPython's ints). And those cover the most obvious audiences for whom "really big ints" are truly important.

CPython could presumably also check for the presence of gmpy2 and use different base-conversion algorithms internally, depending. But nothing is as simple as you'd hope. For example, if you pass a string to int() representing a decimal value so large GMP gets a malloc() failure, GMP will just print a message to stderr and abort the whole process.

So, by all means, pursue this if you like - just don't expect it to be easy :wink:.

gpshead commented 1 year ago

couldn't we 'just' make GMP a dependency and use their algorithms?

Not a silly question. IIRC @vstinner did an experiment of using GMP as the PyLong implementation in the past and unsurprisingly found it to slow things down as it's focus is primarily large numbers and Python programs are normally tiny 1-3 internal-digits number focused? If you wanted to use it, you'd be end up aiming for a multi-implementation PyLong type that switches at math operation time based on the size of the inputs (including conversion costs). That's a lot of complexity.

The other big can of worms that Tim alludes to is licensing. GMP is LGPL v3 / GPL v2. While we do have other L?GPL-linking batteries in the standard library (readline & gdbm at least), but they're optional and not in the core itself. LGPL/GPL is unusable in many CPython environments. So it'd always need to be an optional dependency.

IMNSHO It is easier to allow big-math libraries that are focused on big number performance do the dynamically choose to use it dance as several already appear to do. I don't think we should focus on GMP for this issue.

tim-one commented 1 year ago

Neil has an open PR (gh-96673) to add some asymptotically better int->str, str->int, and int division implementations, but we're in agreement on that "ambition is the fatal enemy of major improvements" in this area. Any number of proposals to achieve major improvements, with relatively simple code, have died over the decades, due to the peanut gallery :wink: throwing tomatoes because the proposals don't cover all possible improvements. So they bloat, and bloat, until the original proposer just gives up.

So Neil (with my enthusiastic support) is aiming to limit the scope of his PR, and even to leave the code in Python. Check it in, it's big wins. Other people can build on it later, if they're willing to devote their time to it.

One of the things we discussed off-line was whether to try to auto-detect gmpy2 too. I repeated the "ambition is the fatal enemy .." mantra, and it hasn't come up again. That doesn't preclude someome pursuing it later.

I've been annoyed at times by CPython's int<->str sloth for 30 years now. I don't care here about DoS vulnerabilities, or competing with GMP, I just want to see some major asymptotic improvements checked in before I die 😉

BTW, this isn't a "not invented here" thing. I'd love to use GMP, but licensing hassles kill that idea in general.

If licensing wasn't a fatal hassle, it would probably be best to re-introduce Python 2's distinction between int and long, and use GMP for the latter. 64-bit addition is typically a 1-cycle HW operation now, and just checking to see how many internal digits a Python long has would swamp the cost of the useful work. But then this is Python, and "an int is an int is an int" is the most Pythonic mental model. Experts can get that 1-cycle add by, e.g., using a numpy fixed-width integer type.

matthiasgoergens commented 1 year ago

@tim-one @gpshead Thanks for indulging me and explaining the difficulties!

vstinner commented 1 year ago

My notes on GMP (copy of my https://lwn.net/Articles/907572/ comments).

We are trying to keep the Python implementation as simple as possible to keep the maintenance burden low enough. More and more tradeoffs are made to make Python faster, but for "bigint", the policy is to guide users to more specialized libraries like GMP. Some proposed optimizations for operations on bigint are rejected to keep the Python code simple enough.

Python performance is correct with "small bigints", bad for "large bigints" :-) I don't know the threshold, it may depend on the machine. I would say that up to 1000 decimal digits, the Python bigint algorithms are fast enough for most use cases on a desktop CPU. For numbers larger than 10,000 decimal digits, you might want even faster algorithms which are more complicated, less people understand them, and it's harder to maintain them. GMP is just a perfect place for such highly optimized algorithms ;-) For example, Python avoids assembly code, whereas GMP loves it :-)

There are two reasons: license and performance. Using GMP requires to allocate an object on the heap memory (1st memory block) and GMP data on the heap (2nd memory block). The GMP license doesn't fit with the Python license. I wrote an implementation of Python using GMP in 2007, it worked but was slower than Python code:

Python is fast to allocate Python integer objects and algorithms on Python ints are fast enough for numbers up to 2^64 (20 decimal digits) which fits most use cases.

I tried that in the early days of Python 3, since the Python 2 "small int" type (single digit) was removed. Python 2 had two integer types ("small" and "long"), Python 3 has a single integer type (only "long").

vstinner commented 1 year ago

While we cannot use gmpy2 in Python, we might make its usage more transparent, but I'm not sure how. It would be helpful to have a command line option to say "create any Python integer with gmpy2".

Seriously, I would like a similar flag to "create any Python float with decimal.Decimal". It would avoid rounding issues with I compute some prices in the Python prompt (REPL).

vstinner commented 1 year ago

One blocker issue to "pluggeable integer types" (using numpy.int64 instead of Python int?) is the isinstance(var, int) test which is false if the custom integer type is not a subclass of the Python int type.

>>> import numpy
>>> x=numpy.int64(3)
>>> isinstance(x, int)
False

In Python 2, there was a base class basestring. In Python 3, Python int is a an instance of the PyLongObject structure at the C level and this structure contains actual digits (allocated on the heap memory). I would be interesting to find a way to have a base int class and/or a way to "register" classes compatible with Python int.

tim-one commented 1 year ago

@vstinner , you have important experience here and useful things to say - but I'm sad to see it recorded in an issue about base conversions. It will just get lost here, and especially after this issue is closed (which I expect to do if/when Neil commits his decidedly non-ambitious PR, focused solely on asymptotic improvements for int<->str, and integer divmod(), on "really big" ints).

How about opening a different issue (or two ...) for much broader ideas? I'd suggest Python-Ideas, except that's where everything goes to die because nobody wants to hear anything new :wink:.

tylerhou commented 1 year ago

What is preventing a pure C implementation from being checked in? Or -- what has blocked previous implementations in the past?

I think it would not be difficult to write a standalone C implementation to convert from decimal <=> binary. But it would require recursion and memory allocation. I don't know enough about CPython development to know whether that would be a problem.

python / cpython

Quadratic time internal base conversions #90716