Use 30-bit digits instead of 15-bit digits for Python integers.

mdickinson commented 15 years ago

BPO	4258
Nosy	@tim-one, @loewis, @gpshead, @mdickinson, @pitrou, @vstinner, @tiran
Dependencies	bpo-5260: longobject.c: minor fixes, cleanups and optimizations
Files	pybench_results.txt: pybench results optimize.patch: Improve mark's patch long_stat.patch 30bit_longdigit13.patch 30bit_longdigit13+optimizations.patch pidigits.py 30bit_longdigit14.patch pidigits_noprint.py pidigits_bestof.py: pidigits benchmark w/warmup and 5 runs 30bit_longdigit13+optimizations1.patch: Improve mark's patch 30bit_longdigit17.patch 30bit_longdigit17+asm.patch 30bit_longdigit20.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/mdickinson' closed_at = created_at = labels = ['interpreter-core', 'performance'] title = 'Use 30-bit digits instead of 15-bit digits for Python integers.' updated_at = user = 'https://github.com/mdickinson' ``` bugs.python.org fields: ```python activity = actor = 'mark.dickinson' assignee = 'mark.dickinson' closed = True closed_date = closer = 'mark.dickinson' components = ['Interpreter Core'] creation = creator = 'mark.dickinson' dependencies = ['5260'] files = ['11950', '11951', '11952', '13108', '13118', '13119', '13122', '13127', '13132', '13137', '13142', '13149', '13164'] hgrepos = [] issue_num = 4258 keywords = ['patch'] message_count = 78.0 messages = ['75491', '75492', '75494', '75512', '75513', '75518', '75520', '75522', '75534', '75536', '75539', '75540', '75551', '75553', '75559', '75560', '75561', '75562', '75695', '75718', '75720', '75746', '77769', '77770', '82116', '82250', '82251', '82318', '82319', '82320', '82321', '82325', '82326', '82342', '82344', '82345', '82346', '82347', '82348', '82350', '82362', '82364', '82368', '82374', '82375', '82385', '82386', '82404', '82407', '82424', '82425', '82426', '82428', '82429', '82430', '82431', '82432', '82433', '82434', '82444', '82445', '82448', '82458', '82466', '82468', '82469', '82533', '82538', '82563', '82599', '82669', '82792', '83389', '83467', '83772', '83773', '83774', '83865'] nosy_count = 11.0 nosy_names = ['tim.peters', 'loewis', 'collinwinter', 'gregory.p.smith', 'mark.dickinson', 'pitrou', 'pernici', 'vstinner', 'christian.heimes', 'jyasskin', 'schuppenies'] pr_nums = [] priority = 'low' resolution = 'accepted' stage = 'patch review' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue4258' versions = ['Python 2.7'] ```

mdickinson commented 15 years ago

Here's an experimental patch, against the py3k branch, that makes Python represent its long integers internally in base 2**30 instead of base 2**15, on platforms that have 32-bit and 64-bit integers available.

On platforms for which autoconf is unable to find both 32-bit and 64-bit integers, the representation falls back to the usual one.

See also bpo-1814 (GMP for longs), and the discussion at

http://mail.python.org/pipermail/python-dev/2008-November/083315.html

(note particularly Tim Peter's message at:

http://mail.python.org/pipermail/python-dev/2008-November/083355.html )

mdickinson commented 15 years ago

Note that to avoid "bad marshal data" errors, you'll probably need to do a 'make distclean' before rebuilding with this patch.

mdickinson commented 15 years ago

Here's an updated patch, with the following changes from the original:

make the size of a digit (both the conceptual size in bits and actual size in bytes) available to Python via a new structseq sys.int_info. This information is useful for the sys.getsizeof tests.
fix a missing cast in long_hash
better fast path for 1-by-1 multiplication that doesn't go via PyLong_FromLongLong.

vstinner commented 15 years ago

Note that to avoid "bad marshal data" errors, you'll probably need to do a 'make distclean' before rebuilding with this patch.

I saw that you choosed to use the base 2^30 for marshal. For a better portability (be able to use .pyc generated without your patch), you may keep the base 2^15. I implemented that in my GMP patch (manual conversion from/to base 2^15).

If we change the marshal format of long, the magic number should be different (we might use a tag like the "full unicode" tag used in Python3 magic number) and/or the bytecode (actual bytecode is 'l'). The base should be independent of the implementation, like Python does with text: UTF-8 for files and UCS-4 in memory. We may use the base 2^8 (256) or another power or 2^8 (2^16, 2^32, 2^64?). The base 256 sounds interresting because any CPU is able to process 8 bits digits.

Cons: Use a different bases makes Python slower for loading/writing from/to .pyc.

gpshead commented 15 years ago

oh yay, thanks. it looks like you did approximately what i had started working on testing a while back but have gone much further and added autoconf magic to try and determine when which size should be used. good.

(i haven't reviewed your autoconf stuff yet)

As for marhsalled data and pyc compatibility, yes that is important to consider.

We should probably base the decision on which digit size to use internally on benchmarks, not just if the platform can support 64bit ints. Many archs support 64bit numbers as a native C type but require multiple instructions or are very slow when doing it.

(embedded arm, mips or ppc come to mind as obvious things to test that on)

vstinner commented 15 years ago

As for marhsalled data and pyc compatibility, yes that is important to consider.

The problem is also that with the 30-bit digit patch, some Python will use 15 bits whereas some other will use 30 bits. The base in marshal should be the same in both cases.

And why 30 bits and not 31 bits, or 63 bits, or 120 bits? We may use other bases in the future. That's why I prefer to use a common base like base 256. And GMP has functions (mpz_import) to load data in base 256, but it's more complicate to load data in base 2^15.

mdickinson commented 15 years ago

[Victor Stinner]

I saw that you choosed to use the base 2^30 for marshal. For a better portability (be able to use .pyc generated without your patch), you may keep the base 2^15. I implemented that in my GMP patch (manual conversion from/to base 2^15).

Agreed. I'll fix this so that the .pyc format is unchanged. Thanks!

And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

Mostly laziness: the change from 15 to 30 bits turned out to be extraordinarily easy. Note that the longobject.c part of the patch does almost nothing except adding a bunch of casts here and there.

31 bits would involve rewriting the powering algorithm, which assumes that PyLong_SHIFT is divisible by 5. It would gain very little over 30 bits, and if Pernici Mario's optimizations are considered (bpo-3944) multiplication would likely be slower with 31-bit digits than with 30-bit digits.

63 (or 62, or 60) bits is simply too large right now: you'd need access to a hardware 64 x 64 -> 128 bit multiply to make this worth it, and I'd guess there are many fewer platforms that make this easy, though I don't really know. I know it's possible on gcc/x8664 by making use of the (undocumented) \_uint128_t type. But even where this type is available, base 2**63 might still be slower than base 2**30. I've done some experiments with multiprecision *decimal arithmetic in C that showed that even on a 64-bit machine, using base 10**9 (i.e. approx 30 bits) was significantly faster than using base 10*\18.

120 bits? Does GMP even go this far? I guess part of the attraction is that it's a multiple of 8...

The other obvious choices to consider would be 32 bits (or 16 bits, or 64 bits). This is possible, and may even be worth it, but it would be a *much* bigger effort; most of the algorithms would need to be rewritten. One problem is again the mismatch between C and assembler: detecting overflow when adding two 32-bit unsigned integers is trivial in x86 assembler, but it's not so obvious how to write portable C code that has a decent chance of being compiled optimally on a wide variety of compilers.

I guess my feeling is simply that the 15-bit to 30-bit change seems incredibly easy and cheap: very little code change, and hence low risk of accidental breakage. So if there are indeed significant efficiency benefits (still to be determined) then it looks like a clear win to me.

[Gregory Smith]

(i haven't reviewed your autoconf stuff yet)

The changes to configure and pyconfig.h are just from rerunning autoconf and autoheader; it's only configure.in that should need looking at.

vstinner commented 15 years ago

> And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

Mostly laziness (...)

It was an argument for changing the base used by the mashal :-)

31 bits would involve rewriting the powering algorithm, which assumes that PyLong_SHIFT is divisible by 5

Powering is an simple algorithm. If it was the division, it would be much harder :-) Python stores the sign of the number in the first digit. Because of that, we are limited to 15/30 bits. Storing the sign in the size (which size? no idea yet) would allows to use a bigger base (31 bits? 63 bits?).

One problem is again the mismatch between C and assembler: detecting overflow when adding two 32-bit unsigned integers is trivial in x86 assembler, but it's not so obvious how to write portable C code that has a decent chance of being compiled optimally on a wide variety of compilers.

I wrote an example to detect overflows in C on the mailing list. Copy of my email: ------------------------------- 8\< ---------------------- About 31, 32, 63 or 64 bits: I guess that you want to avoid integer overflow. Intel has an "overflow" flag, changed for all instructions. For other CPUs, you can use emulate this flag. Example with the type int that I used in my GMP patch:

Add:
  int a, b, sum;
  sum = a + b;
  exact = ((a < 0) ^ (b < 0)) || ((a < 0) == (sum < 0));

Substract:
  int a, b, diff;
  diff = a + b;
  exact = ((a < 0) == (b < 0)) || ((a < 0) == (diff < 0));

Multiply:
  int a, b, product;
  if (a == 0 || b == 0) {
     product = 0;  /* exact */
  } else if (a != INT_MIN || (b == 1)) {
     product = a * b;
     exact = (product / b) == a);
  } else {
     /* INT_MIN * -1 = -INT_MIN: overflow */
  }

Divide:
  int a, b, q;
  if (a != INT_MIN || b != -1) {
     q = a / b;   /* exact */
  } else {
     /* INT_MIN / -1 = -INT_MIN: overflow */
  }

Checking overflow may costs more than using a smaller base. Only a benchmark can answer ;-) ------------------------------- 8\< ----------------------

I guess my feeling is simply that the 15-bit to 30-bit change seems incredibly easy and cheap: very little code change, and hence low risk of accidental breakage.

Python has an amazing regression test suite! I used it to fix my GMP patch. We can experiment new bases using this suite.

Anyway, i love the idea of using 30 bits instead of 15! Most computer are now 32 or 64 bits! But it's safe to keep the 15 bits version to support older computers or buggy compilers.

I started to work with GIT. You may be interrested to work together on GIT. It's much easier to exchanges changeset and play with branches. I will try to publish my GIT tree somewhere.

mdickinson commented 15 years ago

Following Victor's suggestion, here's an updated patch; same as before, except that marshal now uses base 2**15 for reading and writing, independently of whether PyLong_SHIFT is 15 or 30.

vstinner commented 15 years ago

Mark: would it be possible to keep the "2 digits" hack in PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT == 15". The base 2^15 slow, so don't make it slower :-)

/ 2 digits \/
if (!(ival >> 2*PyLong_SHIFT)) {
v = _PyLong_New(2);
if (v) {
Py_SIZE(v) = 2*sign;
v->ob_digit[0] = (digit)ival & PyLong_MASK;
v->ob_digit[1] = ival >> PyLong_SHIFT;
}
return (PyObject*)v;
}

vstinner commented 15 years ago

marshal now uses base 2**15 for reading and writing

Yes, it uses base 2**15 but it's not the correct conversion to base 2**15. You convert each PyLong digit to base 2**15 but not the whole number. As a result, the format is different than the current mashal version.

vstinner commented 15 years ago

PyLong_FromLong() doesn't go into the 1 digit special case for 
negative number. You should use:
    /* Fast path for single-digits ints */
    if (!(abs_ival>>PyLong_SHIFT)) {
        v = _PyLong_New(1);
        if (v) {
            Py_SIZE(v) = sign;
            v->ob_digit[0] = abs_ival;
        }
        return (PyObject*)v;
    }

mdickinson commented 15 years ago

Yes, it uses base 2**15 but it's not the correct conversion to base 2**15. You convert each PyLong digit to base 2**15 but not the whole number.

I don't understand: yes, each base 2**30 digit is converted to a pair of base 2**15 digits, and if necessary (i.e., if the top 15 bits of the most significant base 2**30 digit are zero) the size is adjusted. How is this not converting the whole number?

As a result, the format is different than the current mashal version.

Can you give an example of an integer n such that marshal.dumps(n) gives you different results with and without the patch? As far as I can tell, I'm getting the same marshal results both with the unpatched version and with the patch applied.

mdickinson commented 15 years ago

Other responses...

It was an argument for changing the base used by the mashal :-)

Ah. I think I'm with you now. You're saying that ideally, marshal shouldn't have to care about how Python stores its longs: it should just ask some function in longobject.c to provide an already-converted- to-base-256 representation. Is that right?

I also find the idea of making the marshal representation base 256 quite attractive. There are already functions in longobject.c that could be used for this: _PyLong_{From,As}ByteArray. And then you wouldn't need to touch marshal.c when swapping the GMP version of longobject.c in and out.

Python stores the sign of the number in the first digit. Because of that, we are limited to 15/30 bits.

No: the sign is stored in the size: if v is a PyLongObject then ABS(Py_SIZE(v)) gives the number of digits in the absolute value of the integer represented by v, and the sign of Py_SIZE(v) gives the sign of the integer.

would it be possible to keep the "2 digits" hack in PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT == 15". The base 2^15 slow, so don't make it slower :-)

Why don't we leave this kind of micro-optimization out until we've got some benchmarks. (I'm also tempted to get rid of the long_mul fast path for now.)

PyLong_FromLong() doesn't go into the 1 digit special case for negative number.

Well spotted! Yes, this should be fixed. I have a nasty feeling that I was responsible for introducing this bug some time ago...

mdickinson commented 15 years ago

Here's a pybench comparison, on OS X 10.5/Core 2 Duo/gcc 4.0.1 (32-bit non-debug build of the py3k branch). I got this by doing:

[create clean build of py3k branch] dickinsm$ ./python.exe Tools/pybench/pybench.py -f bench_unpatched [apply 30bit patch and rebuild] dickinsm$ ./python.exe Tools/pybench/pybench.py -c bench_unpatched

Highlights: SimpleLongArithmetic: around 10% faster. SimpleComplexArithmetic: around 16% slower! CompareFloatsIntegers: around 20% slower.

I'll investigate the slowdowns.

vstinner commented 15 years ago

I'll investigate the slowdowns

The problem may comes from int64_t on 32 bits CPU. 32x32 -> 64 may be emulated on your CPU and so it's slower. I improved your patch to make it faster, but I lost all my work because of a misuse of GIT... As I remember:

I fixed PyLong_FromLong() for small negative integer
I unrolled the loop in PyLong_FromLong(): the loop is at least called twice (the number has 2 digits or more)
I added special code for operations on two numbers of 1 digit (each) for long_add(), long_mul(), long_div(), long_bitwise(), etc.
and I don't remember the other changes...

Oh, I have an old patch. I will attach it to this message. About speed, it was:

unpatched: 20600..20900 pystones
your patch: 19900..20100 pystones
- my changes: 20200..20400 pytones

vstinner commented 15 years ago

I wrote a patch to compute stat about PyLong function calls.

make (use setup.py):

PyLong_FromLong: 168572 calls, min=( 0, ), avg=(1.4, ), max=( 3, ) long_bool: 48682 calls, min=( 0, ), avg=(0.2, ), max=( 2, ) long_add: 39527 calls, min=( 0, 0), avg=(0.9, 1.0), max=( 2, 3) long_compare: 39145 calls, min=( 0, 0), avg=(1.2, 1.1), max=( 3, 3) PyLong_AsLong: 33689 calls, min=( 0, ), avg=(0.9, ), max=( 45, ) long_sub: 13091 calls, min=( 0, 0), avg=(0.9, 0.8), max=( 1, 1) long_bitwise: 4636 calls, min=( 0, 0), avg=(0.8, 0.6), max=( 2, 2) long_hash: 1097 calls, min=( 0, ), avg=(0.9, ), max=( 3, ) long_mul: 221 calls, min=( 0, 0), avg=(0.8, 1.1), max=( 2, 2) long_invert: 204 calls, min=( 0, ), avg=(1.0, ), max=( 1, ) long_neg: 35 calls, min=( 1, ), avg=(1.0, ), max=( 1, ) long_format: 3 calls, min=( 0, ), avg=(0.7, ), max=( 1, ) long_mod: 3 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1) long_pow: 1 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1)

pystone:

PyLong_FromLong:1587652 calls, min=( 0, ), avg=(1.0, ), max=( 3, ) long_add: 902487 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 2, 2) long_compare: 651165 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 3, 3) PyLong_AsLong: 252476 calls, min=( 0, ), avg=(1.0, ), max=( 2, ) long_sub: 250032 calls, min=( 1, 0), avg=(1.0, 1.0), max=( 1, 1) long_bool: 102655 calls, min=( 0, ), avg=(0.5, ), max=( 1, ) long_mul: 100015 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 1, 2) long_div: 50000 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1) long_hash: 382 calls, min=( 0, ), avg=(1.1, ), max=( 2, ) long_bitwise: 117 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 1, 2) long_format: 1 calls, min=( 2, ), avg=(2.0, ), max=( 2, )

min/avg/max are the integer digit count (minimum, average, maximum).

What can we learn from this numbers?

PyLong_FromLong(), long_add() and long_compare() are the 3 most common operations on integers.

Except PyLong_FromLong(), long_compare() and long_format(), arguments of the functions are mostly in range [-2^15; 2^15].

Biggest number is a number of 45 digits: maybe just one call to long_add(). Except this number/call, the biggest numbers have between 2 and 3 digits.

long_bool() is never called with number bigger than 2 digits.

long_sub() is never called with number bigger than 1 digit!

vstinner commented 15 years ago

And now the stat of Python patched with 30bit_longdigit3.patch.

min/avg/max are now the number of bits which gives better informations. "bigger" is the number of arguments which are bigger than 1 digit (not in range [-2^30; 2^30]).

make \====

_FromLong: 169734 calls, min=( 0, ), avg=(11.6, ), max=( 32, ) \--> bigger=31086 long_bool: 48772 calls, min=( 0, ), avg=( 0.3, ), max=( 24, ) long_add: 39685 calls, min=( 0, 0), avg=( 6.5, 3.5), max=( 19, 32) \--> bigger=1 long_compare: 39445 calls, min=( 0, 0), avg=( 9.3, 8.4), max=( 31, 33) \--> bigger=10438 _AsLong: 33726 calls, min=( 0, ), avg=( 4.9, ), max=(1321, ) \--> bigger=10 long_sub: 13285 calls, min=( 0, 0), avg=( 7.6, 5.6), max=( 13, 13) long_bitwise: 4690 calls, min=( 0, 0), avg=( 1.7, 1.9), max=( 16, 16) long_hash: 1097 calls, min=( 0, ), avg=( 8.1, ), max=( 33, ) \--> bigger=4 long_mul: 236 calls, min=( 0, 0), avg=( 1.3, 5.4), max=( 17, 17) long_invert: 204 calls, min=( 0, ), avg=( 2.4, ), max=( 3, ) long_neg: 35 calls, min=( 1, ), avg=( 4.3, ), max=( 7, ) long_format: 3 calls, min=( 0, ), avg=( 2.0, ), max=( 4, ) long_mod: 3 calls, min=( 1, 2), avg=( 1.7, 2.0), max=( 2, 2) long_pow: 1 calls, min=( 2, 6), avg=( 2.0, 6.0), max=( 2, 6)

Notes about make:

PyLong_FromLong(), long_compare(), PyLong_AsLong() and long_hash() gets integers not in [-2^30; 2^30] which means that all other functions are only called with arguments of 1 digit!
PyLong_FromLong() gets \~30.000 (18%) integers of 32 bits
global average integer size is between 0.3 and 11.6 (~6.0 bits?)
There are 41.500 (12%) big integers on \~350.000 integers

pystone \=======

_FromLong: 1504983 calls, min=( 0, ), avg=( 5.1, ), max=( 31, ) \--> bigger=14 long_add: 902487 calls, min=( 0, 0), avg=( 3.9, 2.4), max=( 17, 17) long_compare: 651165 calls, min=( 0, 0), avg=( 1.7, 1.4), max=( 31, 31) \--> bigger=27 _AsLong: 252477 calls, min=( 0, ), avg=( 4.6, ), max=( 16, ) long_sub: 250032 calls, min=( 1, 0), avg=( 4.0, 1.6), max=( 7, 7) long_bool: 102655 calls, min=( 0, ), avg=( 0.5, ), max=( 7, ) long_mul: 100015 calls, min=( 0, 0), avg=( 2.5, 2.0), max=( 4, 16) long_truediv: 50000 calls, min=( 4, 2), avg=( 4.0, 2.0), max=( 4, 2) long_hash: 382 calls, min=( 0, ), avg=( 8.1, ), max=( 28, ) long_bitwise: 117 calls, min=( 0, 0), avg=( 6.7, 6.6), max=( 15, 16) long_format: 1 calls, min=(16, ), avg=(16.0, ), max=( 16, )

Notes about pystone:

very very few numbers are bigger than one digit: 41 / \~4.000.000
global average integer size is between 0.5 and 6.7 (~3.0 bits?)
the biggest number has only 31 bits (see long_compare)

Short summary:

pystone doesn't use big integer (1 big integer for 100.000 integers) => don't use pystone!
the average integer size is around 3 or 6 bits, which means that most integers can be stored in 8 bits (-255..255) => we need to focus on the very small numbers => base 2^30 doesn't help for common Python code, it only helps programs using really big numbers (128 bits or more?)

mdickinson commented 15 years ago

Here's a minor update to the patch, that does some extra cleanup:

don't include longintrepr.h in Objects/abstract.c or Objects/boolobject.c --- it's not needed.
fix several places in longobject.c where int should have been size_t or Py_ssize_t
remove some unnecessary forward declarations in longobject.c.
fix PyLong_FromLong for small negative integers

At some point I'll try to separate the pure bugfixes (missing casts, int vs Py_ssize_t, etc.) from the 15-bit to 30-bit conversion.

vstinner commented 15 years ago

Using 30bit_longdigit4.patch, I get this error: "Objects/longobject.c:700: erreur: "SIZE_T_MAX" undeclared (first use in this function)". You might use the type Py_ssize_t with PY_SSIZE_T_MAX. I used INT_MAX to compile the code.

vstinner commented 15 years ago

I like the idea of sys.int_info, but I would prefer a "base" attribute than "bits_per_digit". A base different than 2^n might be used (eg. a base like 10^n for fast conversion from/to string).

mdickinson commented 15 years ago

Here's a version of the 15-bit to 30-bit patch that adds in a souped-up version of Mario Pernici's faster multiplication.

I did some testing of 100x100 digit and 1000x1000 digit multiplies. On 32-bit x86: 100 x 100 digits : around 2.5 times faster 1000 x 1000 digits : around 3 times faster.

On x86_64, I'm getting more spectacular results: 100 x 100 digits : around 5 times faster 1000 x 1000 digits: around 7 times faster!

The idea of the faster multiplication is quite simple: with 30-bit digits, one can fit a sum of 16 30-bit by 30-bit products in a uint64_t. This means that the inner loop for the basecase grade-school multiplication contains one fewer addition and no mask and shift.

[Victor, please don't delete the old longdigit4.patch!]

pitrou commented 15 years ago

Just tested the patch, here are some benchmarks:

./python -m timeit -s "a=100000000;b=777777" "a//b" -> 2.6: 0.253 usec per loop -> 3.1: 0.61 usec per loop -> 3.1 + patch: 0.331 usec per loop

./python -m timeit -s "a=100000000;b=777777" "a*b" -> 2.6: 0.431 usec per loop -> 3.1: 0.302 usec per loop -> 3.1 + patch: 0.225 usec per loop

./python -m timeit -s "a=100000000;b=777777" "a+b" -> 2.6: 0.173 usec per loop -> 3.1: 0.229 usec per loop -> 3.1 + patch: 0.217 usec per loop

But it seems there are some outliers:

./python -m timeit -s "a=100000000**5+1;b=777777**3" "a//b" -> 2.6: 1.13 usec per loop -> 3.1: 1.12 usec per loop -> 3.1 + patch: 1.2 usec per loop

vstinner commented 15 years ago

I wrote a small benchmark tool dedicated to integer operations (+ -

/ etc.): bench_int.py attached to bpo-4294. See also Message75715 and Message75719 for my benchmark results. Short sum up: 2^30 base helps a lot on 64 bits CPU (+26%) whereas the speedup is small (4%) on 32 bits CPU. But don't trust benchmarks :-p

mdickinson commented 15 years ago

The most recent patch is out of date and no longer applies cleanly. I'm working on an update.

mdickinson commented 15 years ago

Updated patch against py3k. I'm interested in getting this into the trunk as well, but py3k is more important (because *all* integers are long integers). It's also a little more complicated to do this for py3k (mostly because of all the small integer caching), so backporting to 2.7 is easier than trying to forward port a patch from 2.7 to 3.1.

Notes:

I've added a configure option --enable-big-digits (there are probably better names), enabled by default. So you can use --disable-big-digits to get the old 15-bit behaviour.
I *think* this patch should work on Windows; confirmation would be appreciated.
I've removed the fast multiplication code, in the interests of keeping the patch simple. If this patch goes in, we can concentrate on speeding up multiplication afterwards. For now, note that 30-bit digits give the *potential* for significant speedups in multiplication and division (see next item).
There's a nasty 'feature' in x_divmod: the multiplication in the innermost loop is digit-by-twodigits -> twodigits, when it should be digit-by-digit -> twodigits; this probably causes significant slowdown with 30-bit digits. This may explain Antoine's outlier. Again, if this patch goes in I'll work on fixing x_divmod.
Re: Victor's comment about a 'base' attribute: I tried this, but quickly discovered that we still need the 'bits_per_digit' for tests. I think that binaryness is so ingrained that it's not really worth worrying about the possibility of the base changing from a power of 2 to a power of 10. So in the end I left base out.
It did occur to me that NSMALLPOSINTS and NSMALLNEGINTS might usefully be exposed in sys.int_info, mostly for the purposes of testing. Thoughts?

mdickinson commented 15 years ago

Forgot to mention: you'll need to rerun autoconf and autoheader after applying the patch and before doing ./configure

mdickinson commented 15 years ago

Antoine, were your posted results on a 64-bit or a 32-bit system?

pitrou commented 15 years ago

Mark, I think it was 32-bit at the time.

pitrou commented 15 years ago

Now with the latest patch, and under a 64-bit system (the same one actually, but with a 64-bit distro):

pybench is roughly 2% slower
timeit -s "a=100000000;b=777777" "a//b"
before: 0.563 usec per loop
after: 0.226 usec per loop
timeit -s "a=100000000;b=777777" "a*b"
before: 0.179 usec per loop
after: 0.131 usec per loop
timeit -s "a=100000000;b=777777" "a+b"
before: 0.174 usec per loop
after: 0.134 usec per loop

pitrou commented 15 years ago

Actually, I think my previous results were in 64-bit mode already.

By the way, I don't think unconditionally using uint64_t is a good thing on 32-bit CPUs. uint64_t might be an emulated type, and operations will then be very slow. It would be better to switch based on sizeof(long) (for LP64 systems) or sizeof(void *) (for LLP64 systems).

pitrou commented 15 years ago

Actually, I still get a speedup on a 32-bit build. :)

mdickinson commented 15 years ago

Here's a version of the patch that includes optimizations to basecase multiplication, and a streamlined x_divrem for faster division. With Victor's benchmark, I'm getting 43% speed increase on 64-bit Linux/Core 2 Duo.

Note: the base patch is stable and ready for review; in contrast, the optimizations are still in a state of flux, so the +optimizations patch is just there as an example of what might be possible.

About using uint64_t:

the 64-bit type isn't really used very much: its main role is as the result type of a 32-bit by 32-bit multiplication. So it might not matter too much if it's an emulated type; what's important is that the 32-bit by 32-bit multiply with 64-bit results is done in a single CPU instruction. I don't know how to test for this. Do you know of a mainstream system where this isn't true?

I'll test this tonight on 32-bit PPC and 32=bit Intel, and report back.

I don't care very much about trying to *automatically* do the right thing for small or embedded systems: they can use the --disable-big-digits configure option to turn 30-bit digits off.

Antoine, do you think we should be using 30-bit digits by default *only* on 64-bit machines? I guess I could go with that, if it can be manually overridden by the configure option.

pitrou commented 15 years ago

As I said, I actually see a speedup as well on a 32-bit build on a 64-bit CPU. So the current patch (30bit_longdigit13.patch) is fine.

pitrou commented 15 years ago

Some more benchmarks results (with 30bit_longdigit13.patch):

Victor's bench_int.py:
32-bit without patch: 1370.1 ms
32-bit with patch: 1197.8 ms (23% speedup)
64-bit without patch: 1357.6 ms
64-bit with patch: 981.6 ms (28% speedup)
calculating 2000 digits of pi (*):
32-bit without patch: 2.87 s.
32-bit with patch: 2.87 s. (0% speedup: ???)
64-bit without patch: 3.35 s.
64-bit with patch: 1.68 s. (50% speedup)

(*) using the following script adapted for py3k: http://shootout.alioth.debian.org/u64q/benchmark.php?test=pidigits&lang=python&id=1

pitrou commented 15 years ago

Here's the py3k version of pidigits.py.

mdickinson commented 15 years ago

Thanks, Antoine. I've reworked the configure stuff anyway: the decision about what size digits to use should take place in pyport.h rather than Include/longintrepr.h. Updated patches will arrive shortly!

mdickinson commented 15 years ago

Updated non-optimized patch. The only real change is that I've moved some of the configuration stuff around (so not worth re-benchmarking this one); I hope that I've now got the division of labour correct:

configure script simply parses the --enable-big-digits option and sets PYLONG_DIGIT_SIZE in pyconfig.h to 15 or 30, or leaves it undefined if that option wasn't given to configure
PC/pyconfig.h doesn't define PYLONG_DIGIT_SIZE
pyport.h chooses a suitable value for PYLONG_DIGIT_SIZE if it's not already defined
Include/longintrepr.h just follows orders: it expects PYLONG_DIGIT_SIZE to be defined already, and complains if PYLONG_DIGIT_SIZE=30 but the necessary integer types aren't available.

Thanks for all the benchmarking.

I'd probably better check on python-dev before pushing this in, since it's a new feature. I hope no-one wants a PEP. :-)

pitrou commented 15 years ago

The last patch (30bit_longdigit14.patch) is obviously missing some stuff, but other than that I think everything's fine and you could commit.

mdickinson commented 15 years ago

Oops. Here's the correct patch.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 15 years ago

Has any conclusion been reached wrt. overhead of 30-bit multiplication on 32-bit systems? IIUC, the single-digit multiplication is equivalent to the C program

unsigned long long m(unsigned long long a, unsigned long b)
{
        return a*b;
}

(i.e. one digit is cast into two digits, and multiplied with the other one). gcc 4.3.3, on x86, compiles this into

    movl    12(%esp), %eax
    movl    8(%esp), %ecx
    imull   %eax, %ecx
    mull    4(%esp)
    leal    (%ecx,%edx), %edx
    ret

In pseudo-code, this is

        tmp = high_a * b;
        high_res:low_res = low_a * b;
        high_res += tmp

So it does use two multiply instructions (plus an add), since one argument got cast into 64 bits.

VS2008 compiles it into

push    eax
push    ecx
push    0
push    edx
call    \_\_allmu

i.e. it widens both arguments to 64 bits, then calls a library routine.

mdickinson commented 15 years ago

unsigned long long m(unsigned long long a, unsigned long b) { return a*b; }

I think that's doing a 32 x 64 -> 64 multiplication; what's being used is more like this:

unsigned long long m(unsigned long a, unsigned long b) { return (unsigned long long)a*b; }

which gcc -O3 compiles to:

pushl   %ebp
movl    %esp, %ebp
movl    12(%ebp), %eax
mull    8(%ebp)
leave
ret

mdickinson commented 15 years ago

Patch uploaded to Rietveld (assuming that I did it right):

http://codereview.appspot.com/14105

mdickinson commented 15 years ago

It looks as though Visual Studio 2008 does the 'right' thing, too, at least in some circumstances. Here's some assembler output (MSVC Express Edition, 32-bit Windows XP / Macbook Pro).

; 3 : unsigned long long mul(unsigned long x, unsigned long y) {

push    ebp
mov ebp, esp

; 4 : return (unsigned long long)x * y;

mov eax, DWORD PTR \_x$[ebp]
mov ecx, DWORD PTR \_y$[ebp]
mul ecx

; 5 : }

pop ebp
ret 0

vstinner commented 15 years ago

Patch uploaded to Rietveld (assuming that I did it right): http://codereview.appspot.com/14105

Hehe, your configure's patch is too huge for Rietveld which displays a "MemoryError" :-) Bug reported at: http://code.google.com/p/rietveld/issues/detail?id=87

vstinner commented 15 years ago

Ok, let's try 30bit_longdigit14.patch:

patch -p0 \< 30bit_longdigit14.patch autoconf && autoheader ./configure && make

I'm using two computers:

marge: Pentium4, 32 bits, 3 GHz (32 bits)
lisa: Core Quad (Q9300), 64 bits, 2.5 GHz (64 bits)

Both uses 32 bits digit (and 64 bits twodigits), py3k trunk and Linux.

My bench_int.py:
- 32-bit without patch: 1670.7 ms
- 32-bit with patch: 1547.8 ms (+7.4%)
- 64-bit without patch: 885.2 ms
- 64-bit with patch: 627.1 ms (+29.2%)
pidigits 2000 (I removed the calls to print): lowest result on 5 runs:
- 32-bit without patch: 2991.5 ms
- 32-bit with patch: 3445.4 ms (-15.2%) SLOWER!
- 64-bit without patch: 1949.9 ms
- 64-bit with patch: 973.0 ms (+50.1%)
pybench.py (minimum total)
- 32-bit without patch: 9209 ms
- 32-bit with patch:
- 64-bit without patch: 4430 ms
- 64-bit with patch: 4330 ms (=)

pybench details:

Test 32 bits (without,patch) | 64 bits (without,patch) ----------------------------------------------------------------- CompareFloatsIntegers: 293ms 325ms | 113ms 96ms CompareIntegers: 188ms 176ms | 129ms 98ms DictWithIntegerKeys: 117ms 119ms | 73ms 69ms SimpleIntFloatArithmetic: 192ms 204ms | 84ms 80ms SimpleIntegerArithmetic: 188ms 196ms | 84ms 80ms -----------------------------------------------------------------

On 64 bits, all integer related tests are faster. On 32 bits, some tests are slower.

Sum up: on 64 bits, your patch is between cool (30%) and awesome (50%) :-) On 32 bits, it's not a good idea to use 32 bits digit because it's a little bit slower.

=> I would suggest to use 2^30 base only if sizeof(long)>=8 (64 bits CPU).

Note: I already get similar result (2^30 is slower on 32 bits CPU) in older tests.

vstinner commented 15 years ago

New attachment: pidigits_noprint.py, hacked version of pidigits.py to remove the print and add the computation time in millisecond. Print was useless because we don't want to benchmark int->str conversion, especially when the integer is in [0; 9].

mdickinson commented 15 years ago

Thanks very much for the timings, Victor.

Just out of interest, could you try the pydigits script with the +optimizations patch on 32-bit?

As mentioned above, there's a significant (for 30-bit digits) problem with x_divrem: the inner loop does a 32 x 64-bit multiply when it should be doing a 32 x 32-bit multiply (the variable q is declared as twodigits, but always fits into a digit). This is fixed in the +optimizations patch, and pi_digits is heavy on the divisions, so I wonder whether this might make a difference.

gpshead commented 15 years ago

I would suggest to use 2^30 base only if sizeof(long)>=8 (64 bits CPU).

Thats not the correct test. Test for an actual 64-bit build target. sizeof(long) and sizeof(long long) are not usefully related to that in any sort of cross platform manner. On windows, we'd define the flag for 15 vs 30 bit longs in the build configs for the various build targets. On every thing else (autoconf), we can use a configure test to check the same things that platform.architecture() checks to return '32bit' vs '64bit'.

mdickinson commented 15 years ago

Reviewers: Martin v. Löwis,

http://codereview.appspot.com/14105/diff/1/11 File Doc/library/sys.rst (right):

http://codereview.appspot.com/14105/diff/1/11#newcode418 Line 418: A struct sequence that holds information about Python's Agreed. All that's important here is the attribute access.

http://codereview.appspot.com/14105/diff/1/6 File Include/longintrepr.h (right):

http://codereview.appspot.com/14105/diff/1/6#newcode24 Line 24: Furthermore, NSMALLNEGINTS and NSMALLPOSINTS should fit in a digit. */ On 2009/02/17 22:39:18, Martin v. Löwis wrote:

Merge the comments into a single on. There is no need to preserve the evolution of the code in the comment structure.

Done, along with a general rewrite of this set of comments.

http://codereview.appspot.com/14105/diff/1/9 File Objects/longobject.c (right):

http://codereview.appspot.com/14105/diff/1/9#newcode2872 Line 2872: / XXX benchmark this! Is is worth keeping? \/ On 2009/02/17 22:39:18, Martin v. Löwis wrote:

Why not PyLong_FromLongLong if available (no special case if not)?

Yes, PyLong_FromLongLong would make sense. If this is not available, we still need to make sure that CHECK_SMALL_INT gets called.

http://codereview.appspot.com/14105/diff/1/10 File PC/pyconfig.h (right):

http://codereview.appspot.com/14105/diff/1/10#newcode318 Line 318: #define PY_UINT64T unsigned \_int64 On 2009/02/17 22:39:18, Martin v. Löwis wrote:

I think this should use PY_LONGLONG, to support MingW32; likewise, \_int32 shouldn't be used, as it is MSC specific

Ok. I'll use PY_LONG_LONG for 64-bit, and try int and long for 32-bit.

http://codereview.appspot.com/14105/diff/1/2 File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160 Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p); On 2009/02/17 22:39:18, Martin v. Löwis wrote:

This needs to deal with overflow (sizeof(size_t) > sizeof(long))

Hmm. It looks as though there are many places in this file, particularly in w_object, that do "w_long((long)n, p), where n has type Py_ssize_t. Presumably all of these should be fixed.

http://codereview.appspot.com/14105/diff/1/2#newcode540 Line 540: if (n \< -INT_MAX || n > INT_MAX) On 2009/02/17 22:39:18, Martin v. Löwis wrote:

I think this is obsolete now; longs can have up to ssize_t_max digits.

Agreed. Again, this needs to be fixed throughout marshal.c (many occurrences in r_object).

http://codereview.appspot.com/14105/diff/1/8 File configure.in (right):

http://codereview.appspot.com/14105/diff/1/8#newcode3132 Line 3132: # determine what size digit to use for Python's longs On 2009/02/17 22:39:18, Martin v. Löwis wrote:

I'm skeptical (-0) that we really need to have such a configure option.

I think it's potentially useful to be able to do --disable-big-digits on platforms where the compiler isn't smart enough to translate a 32-bit by 32-bit multiply into the appropriate CPU instruction, so that using 30-bit digits might hurt performance.

I've also found it handy during debugging and testing. But I guess I'm only +0.5 on the configure option; if others think that it's just unnecessary clutter then I'll remove it.

http://codereview.appspot.com/14105/diff/1/14 File pyconfig.h.in (left):

http://codereview.appspot.com/14105/diff/1/14#oldcode9 Line 9: #undef AC_APPLE_UNIVERSAL_BUILD On 2009/02/17 22:39:18, Martin v. Löwis wrote:

We should find out why this is gone.

Looks like an autoconf 2.63/autoconf 2.61 difference. Whoever previously ran autoconf and autoheader used 2.63; I used 2.61. (Which explains the huge configure diff as well.)

Description: This patchset makes it possible for Python to use base 2**30 instead of base 2**15 for its internal representation of arbitrary-precision integers.

The aim is both to improve performance of integer arithmetic, and to make possible some additional optimizations (not currently included in this patchset).

The patchset includes:

a new configure option --enable-big-digits
a new structseq sys.int_info giving information about the internal representation

See http://bugs.python.org/issue4258 for the related tracker discussion.

Please review this at http://codereview.appspot.com/14105

Affected files: M Doc/library/sys.rst M Include/longintrepr.h M Include/longobject.h M Include/pyport.h M Lib/test/test_long.py M Lib/test/test_sys.py M Objects/longobject.c M PC/pyconfig.h M Python/marshal.c M Python/sysmodule.c M configure M configure.in M pyconfig.h.in

python / cpython

Use 30-bit digits instead of 15-bit digits for Python integers. #48508