Closed vstinner closed 7 years ago
Python 2 has fast path in ceval.c for operations (a+b, a-b, etc.) on small integers ("int" type) if the operation does not overflow.
We loose these fast-path in Python 3 when we dropped the int type in favor of the long type.
Antoine Pitrou proposed a fast-path, but only for int singletons (integers in the range [-5; 255]): issue bpo-10044. His patch was rejected because it introduces undefined behaviour.
I propose to reimplemenet Python 2 optimization for long with a single digit, which are the most common numbers.
Pseudo-code for BINARY_ADD: ---
if (PyLong_CheckExact(x) && Py_ABS(Py_SIZE(x)) == 1
&& PyLong_CheckExact(y) && Py_ABS(Py_SIZE(y)) == 1)
{
stwodigits a = ..., b = ...;
stwodigits c;
if (... a+b will not overflow ...) {
c = a + b;
return PyLong_FromLongLong(c);
}
}
/* fall back to PyNumber_Add() */
The code can be copied from longobject.c, there are already fast-path for single digit numbers. See for example long_mul(): ---
/* fast path for single-digit multiplication */
if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
....
}
As any other optimization, it should be proved to be faster with benchmarks.
On: if (... a+b will not overflow ...) {
Since you limited the optimization for addition to single digit numbers, at least for addition and subtraction, overflow is impossible. The signed twodigit you use for the result is guaranteed to be able to store far larger numbers than addition of single digits can produce. In fact, due to the extra wasted bit on large (30 bit) digits, if you used a fixed width 32 bit type for addition/subtraction, and a fixed width 64 bit type for multiplication, overflow would be impossible regardless of whether you used 15 or 30 bit digits.
On a related note: Presumably you should check if the abs(size) \<= 1 like in longobject.c, not == 1, or you omit the fast path for 0. Doesn't come up much, not worth paying extra to optimize, but it costs nothing to handle it.
Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful.
Serhiy Storchaka added the comment:
Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful.
I read that Mark rejected the issue bpo-10044 because it introduces an undefined behaviour.
I'm not interested to work on this issue right now. If anyone is interested, please go ahead!
There also used to be a fast path for binary subscriptions with integer indexes. I would like to see that performance regression fixed if it can be done cleanly.
So I'm trying something pretty similar to Victor's pseudo-code and just using timeit to look for speedups timeit('x+x', 'x=10', number=10000000) before: 1.1934231410000393 1.1988609210002323 1.1998214110003573 1.206968028999654 1.2065417159997196
after: 1.1698650090002047 1.1705158909999227 1.1752884750003432 1.1744818619999933 1.1741297110002051 1.1760422649999782
Small improvement. Haven't looked at optimizing BINARY_SUBSCR yet.
Thank you Zach. I found even small regression.
Before:
$ ./python -m timeit -s "x = 10" "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
1000000 loops, best of 3: 1.51 usec per loop
After:
$ ./python -m timeit -s "x = 10" "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
1000000 loops, best of 3: 1.6 usec per loop
bench_long.py: micro-benchmark for x+y. I confirm a slow down with 21955.patch. IMO you should at least inline PyLong_AsLong() which can be simplified if the number has 0 or 1 digit. Here is my patch "inline.patch" which is 21955.patch with PyLong_AsLong() inlined.
Benchmark result (patch=21955.patch, inline=inline.patch):
Common platform: Platform: Linux-3.14.8-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Bits: int=32, long=64, long long=64, size_t=64, void*=64 CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Python unicode implementation: PEP-393 Timer: time.perf_counter
Platform of campaign orig: Date: 2014-07-16 10:04:27 Python version: 3.5.0a0 (default:08b3ee523577, Jul 16 2014, 10:04:23) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577 tag=tip branch=default date="2014-07-15 13:23 +0300" Timer precision: 40 ns
Platform of campaign patch: Timer precision: 40 ns Date: 2014-07-16 10:04:01 Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:02:12) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"
Platform of campaign inline: Timer precision: 31 ns Date: 2014-07-16 10:11:21 Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:10:48) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"
--------------------+-------------+---------------+--------------- Tests | orig | patch | inline --------------------+-------------+---------------+--------------- 1+2 | 23 ns () | 24 ns | 21 ns (-12%) "1+2" ran 100 times | 1.61 us () | 1.74 us (+7%) | 1.39 us (-14%) --------------------+-------------+---------------+--------------- Total | 1.64 us (*) | 1.76 us (+7%) | 1.41 us (-14%) --------------------+-------------+---------------+---------------
(I removed my message because I posted the wrong benchmark output, inline column was missing.)
Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255).
Before:
$ ./python -m timeit -s "x=10" "x+x"
10000000 loops, best of 3: 0.143 usec per loop
$ ./python -m timeit -s "x=1000" "x+x"
1000000 loops, best of 3: 0.247 usec per loop
After:
$ ./python -m timeit -s "x=10" "x+x"
10000000 loops, best of 3: 0.117 usec per loop
$ ./python -m timeit -s "x=1000" "x+x"
1000000 loops, best of 3: 0.209 usec per loop
All measures are made with modified timeit (bpo-21988).
Well, dont' I feel silly. I confirmed both my regression and the inline speedup using the benchmark Victor added. I wonder if I got my binaries backwards in my first test...
I did something similar to BINARY_SUBSCR after looking at the 2.7 source as Raymond suggested. Hopefully I got my binaries straight this time :) The new patch includes Victor's inlining and my new subscript changes.
Platform of campaign orig: Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 15 2014, 18:11:28) [GCC 4.6.3] Timer precision: 6 ns Date: 2014-07-21 20:28:30
Platform of campaign patch: Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 21 2014, 20:21:20) [GCC 4.6.3] Timer precision: 20 ns Date: 2014-07-21 20:28:39
---------------------+-------------+--------------- Tests | orig | patch ---------------------+-------------+--------------- 1+2 | 118 ns () | 103 ns (-13%) "1+2" ran 100 times | 7.28 us () | 5.93 us (-19%) x[1] | 120 ns () | 98 ns (-19%) "x[1]" ran 100 times | 7.35 us () | 5.31 us (-28%) ---------------------+-------------+--------------- Total | 14.9 us (*) | 11.4 us (-23%) ---------------------+-------------+---------------
Please run the actual benchmark suite to get interesting numbers: http://hg.python.org/benchmarks
I ran the whole benchmark suite. There are a few that are slower: call_method_slots, float, pickle_dict, and unpack_sequence.
Report on Linux zach-vbox 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 Total CPU cores: 1
### 2to3 ### 24.789549 -> 24.809551: 1.00x slower
### call_method_slots ### Min: 1.743554 -> 1.780807: 1.02x slower Avg: 1.751735 -> 1.792814: 1.02x slower Significant (t=-26.32) Stddev: 0.00576 -> 0.01823: 3.1660x larger
### call_method_unknown ### Min: 1.828094 -> 1.739625: 1.05x faster Avg: 1.852225 -> 1.806721: 1.03x faster Significant (t=2.28) Stddev: 0.01874 -> 0.24320: 12.9783x larger
### call_simple ### Min: 1.353581 -> 1.263386: 1.07x faster Avg: 1.397946 -> 1.302046: 1.07x faster Significant (t=24.28) Stddev: 0.03667 -> 0.03154: 1.1629x smaller
### chaos ### Min: 1.199377 -> 1.115550: 1.08x faster Avg: 1.230859 -> 1.146573: 1.07x faster Significant (t=16.24) Stddev: 0.02663 -> 0.02525: 1.0544x smaller
### django_v2 ### Min: 2.682884 -> 2.633110: 1.02x faster Avg: 2.747521 -> 2.690486: 1.02x faster Significant (t=9.90) Stddev: 0.02744 -> 0.03010: 1.0970x larger
### fastpickle ### Min: 1.751475 -> 1.597340: 1.10x faster Avg: 1.771805 -> 1.613533: 1.10x faster Significant (t=64.81) Stddev: 0.01177 -> 0.01263: 1.0727x larger
### float ### Min: 1.254858 -> 1.293067: 1.03x slower Avg: 1.336045 -> 1.365787: 1.02x slower Significant (t=-3.30) Stddev: 0.04851 -> 0.04135: 1.1730x smaller
### json_dump_v2 ### Min: 17.871819 -> 16.968647: 1.05x faster Avg: 18.428747 -> 17.483397: 1.05x faster Significant (t=4.10) Stddev: 1.60617 -> 0.27655: 5.8078x smaller
### mako ### Min: 0.241614 -> 0.231678: 1.04x faster Avg: 0.253730 -> 0.240585: 1.05x faster Significant (t=8.93) Stddev: 0.01912 -> 0.01327: 1.4417x smaller
### mako_v2 ### Min: 0.225664 -> 0.213179: 1.06x faster Avg: 0.234850 -> 0.225984: 1.04x faster Significant (t=10.12) Stddev: 0.01379 -> 0.01391: 1.0090x larger
### meteor_contest ### Min: 0.777612 -> 0.758924: 1.02x faster Avg: 0.799580 -> 0.780897: 1.02x faster Significant (t=3.97) Stddev: 0.02482 -> 0.02212: 1.1221x smaller
### nbody ### Min: 0.969724 -> 0.883935: 1.10x faster Avg: 0.996416 -> 0.918375: 1.08x faster Significant (t=12.65) Stddev: 0.02426 -> 0.03627: 1.4951x larger
### nqueens ### Min: 1.142745 -> 1.128195: 1.01x faster Avg: 1.296659 -> 1.162443: 1.12x faster Significant (t=2.75) Stddev: 0.34462 -> 0.02680: 12.8578x smaller
### pickle_dict ### Min: 1.433264 -> 1.467394: 1.02x slower Avg: 1.468122 -> 1.506908: 1.03x slower Significant (t=-7.20) Stddev: 0.02695 -> 0.02691: 1.0013x smaller
### raytrace ### Min: 5.454853 -> 5.538799: 1.02x slower Avg: 5.530943 -> 5.676983: 1.03x slower Significant (t=-8.64) Stddev: 0.05152 -> 0.10791: 2.0947x larger
### regex_effbot ### Min: 0.205875 -> 0.194776: 1.06x faster Avg: 0.211118 -> 0.198759: 1.06x faster Significant (t=5.10) Stddev: 0.01305 -> 0.01112: 1.1736x smaller
### regex_v8 ### Min: 0.141628 -> 0.133819: 1.06x faster Avg: 0.147024 -> 0.140053: 1.05x faster Significant (t=2.72) Stddev: 0.01163 -> 0.01388: 1.1933x larger
### richards ### Min: 0.734472 -> 0.727501: 1.01x faster Avg: 0.760795 -> 0.743484: 1.02x faster Significant (t=3.50) Stddev: 0.02778 -> 0.02127: 1.3061x smaller
### silent_logging ### Min: 0.344678 -> 0.336087: 1.03x faster Avg: 0.357982 -> 0.347361: 1.03x faster Significant (t=2.76) Stddev: 0.01992 -> 0.01852: 1.0755x smaller
### simple_logging ### Min: 1.104831 -> 1.072921: 1.03x faster Avg: 1.146844 -> 1.117068: 1.03x faster Significant (t=4.02) Stddev: 0.03552 -> 0.03848: 1.0833x larger
### spectral_norm ### Min: 1.710336 -> 1.688910: 1.01x faster Avg: 1.872578 -> 1.738698: 1.08x faster Significant (t=2.35) Stddev: 0.40095 -> 0.03331: 12.0356x smaller
### tornado_http ### Min: 0.849374 -> 0.852209: 1.00x slower Avg: 0.955472 -> 0.916075: 1.04x faster Significant (t=4.82) Stddev: 0.07059 -> 0.04119: 1.7139x smaller
### unpack_sequence ### Min: 0.000030 -> 0.000020: 1.52x faster Avg: 0.000164 -> 0.000174: 1.06x slower Significant (t=-13.11) Stddev: 0.00011 -> 0.00013: 1.2256x larger
### unpickle_list ### Min: 1.333952 -> 1.212805: 1.10x faster Avg: 1.373228 -> 1.266677: 1.08x faster Significant (t=16.32) Stddev: 0.02894 -> 0.03597: 1.2428x larger
What's the status of this issue?
I haven't looked at it since I posted the benchmark results for 21955_2.patch.
Anybody still looking at this? I can take another stab at it if it's still in scope.
Anybody still looking at this? I can take another stab at it if it's still in scope.
There were some visible speedups from your patch -- I think we should merge this optimization. Can you figure why unpack_sequence and other benchmarks were slower?
Can you figure why unpack_sequence and other benchmarks were slower? I didn't look really closely, A few of the slower ones were floating point heavy, which would incur the slow path penalty, but I can dig into unpack_sequence.
I'm assigning this patch to myself to commit it in 3.6 later.
I took another look at this, and tried applying it to 3.6 and running the latest benchmarks. It applied cleanly, and the benchmark results were similar, this time unpack_sequence and spectral_norm were slower. Spectral norm makes sense, it's doing lots of FP addition. The unpack_sequence instruction looks like it already has optimizations for unpacking lists and tuples onto the stack, and running dis on the test showed that it's completely dominated calls to unpack_sequence, load_fast, and store_fast so I still don't know what's going on there.
Any change that increases the cache or branch predictor footprint of the evaluation loop may make the interpreter slower, even if the change doesn't seem related to a particular benchmark. That may be the reason here.
unpack_sequence contains 400 lines of this: "a, b, c, d, e, f, g, h, i, j = to_unpack". This code doesn't even touch BINARY_SUBSCR or BINARY_ADD.
Zach, could you please run your benchmarks in rigorous mode (perf.py -r)? I'd also suggest to experiment with putting the baseline cpython as a first arg and as a second -- maybe your machine runs the second interpreter slightly faster.
I ran 6 benchmarks on my work machine(not the same one as the last set) overnight. Two with just the BINARY_ADD change, two with the BINARY_SUBSCR change, and two with both. I'm attaching the output from all my benchmark runs, but here are the highlights In this table I've flipped the results for running the modified build as the reference, but in the new attachment, slower in the right column means faster, I think :) | ------------------ | --------------------------------------- | ----------------------------------- | Build | Baseline Reference | Modified Reference | |
---|---|---|---|---|---|---|---|
Faster | Slower | Faster | Slower | ||||
------------------ | -------------------- | ------------------ | -------------------- | -------------- | |||
BINARY_ADD | chameleon_v2 | etree_parse | chameleon_v2 | call_simple | |||
chaos | nbody | fannkuch | nbody | ||||
django | normal_startup | normal_startup | pickle_dict | ||||
etree_generate | pickle_dict | nqueens | regex_v8 | ||||
fannkuch | pickle_list | regex_compile | |||||
formatted_logging | regex_effbot | spectral_norm | |||||
go | unpickle_list | ||||||
json_load | |||||||
regex_compile | |||||||
simple_logging | |||||||
spectral_norm | |||||||
------------------ | -------------------- | ------------------ | -------------------- | -------------- | |||
BINARY_SUBSCR | chameleon_v2 | call_simple | 2to3 | etree_parse | |||
chaos | go | call_method_slots | json_dump_v2 | ||||
etree_generate | pickle_list | chaos | pickle_dict | ||||
fannkuch | telco | fannkuch | |||||
fastpickle | formatted_logging | ||||||
hexiom2 | go | ||||||
json_load | hexiom2 | ||||||
mako_v2 | mako_v2 | ||||||
meteor_contest | meteor_contest | ||||||
nbody | nbody | ||||||
regex_v8 | normal_startup | ||||||
spectral_norm | nqueens | ||||||
pickle_list | |||||||
simple_logging | |||||||
spectral_norm | |||||||
telco | |||||||
------------------ | -------------------- | ------------------ | -------------------- | -------------- | |||
BOTH | chameleon_v2 | call_simple | chameleon_v2 | fastpickle | |||
chaos | etree_parse | choas | pickle_dict | ||||
etree_generate | pathlib | etree_generate | pickle_list | ||||
etree_process | pickle_list | etree_process | telco | ||||
fannkuch | fannkuch | ||||||
fastunpickle | float | ||||||
float | formatted_logging | ||||||
formatted_logging | go | ||||||
hexiom2 | hexiom2 | ||||||
nbody | nbody | ||||||
nqueens | normal_startup | ||||||
regex_v8 | nqueens | ||||||
spectral_norm | simple_logging | ||||||
unpickle_list | spectral_norm | ||||||
------------------ | -------------------- | ------------------ | -------------------- | -------------- |
unpack_sequence is nowhere to be seen and spectral_norm is faster now...
Attaching a new patch -- rewritten to optimize -, *, +, -=, *= and +=. I also removed the optimization of [] operator -- that should be done in a separate patch and in a separate issue.
Some nano-benchmarks (best of 3):
python -m timeit "sum([x + x + 1 for x in range(100)])" 2.7: 7.71 3.5: 8.54 3.6: 7.33
python -m timeit "sum([x - x - 1 for x in range(100)])" 2.7: 7.81 3.5: 8.59 3.6: 7.57
python -m timeit "sum([x x 1 for x in range(100)])" 2.7: 9.28 3.5: 10.6 3.6: 9.44
Python 3.6 vs 3.5 (spectral_norm, rigorous run): Min: 0.315917 -> 0.276785: 1.14x faster Avg: 0.321006 -> 0.284909: 1.13x faster
Zach, thanks a lot for the research! I'm glad that unpack_sequence finally proved to be irrelevant. Could you please take a look at the updated patch?
python -m timeit "sum([x x 1 for x in range(100)])"
If you only want to benchmark x*y, x+y and list-comprehension, you should use a tuple for the iterator.
In this table I've flipped the results for running the modified build > as the reference, but in the new attachment, slower in the right column means faster, I think :)
I don't understand what this table means (why 4 columns?). Can you explain what you did?
I don't understand what this table means (why 4 columns?). Can you explain what you did?
Yury suggested running perf.py twice with the binaries swapped So "faster" and "slower" underneath "Baseline Reference" are runs where the unmodified python binary was the first argument to perf, and the "Modified Reference" is where the patched binary is the first argument.
ie. "perf.py -r -b all python patched_python" vs "perf.py -r -b all patched_python python"
bench_results.txt has the actual output in it, and the "slower in the right column" comment was referring to the contents of that file, not the table. Sorry for the confusion.
Yury suggested running perf.py twice with the binaries swapped
Yeah, I had some experience with perf.py when its results were skewed depending on what you test first. Hopefully Victor's new patch will fix that http://bugs.python.org/issue26275
Could you please take a look at the updated patch? Looks ok to me, for whatever that's worth.
Le 03/02/2016 18:21, Yury Selivanov a écrit :
Yury Selivanov added the comment:
> Yury suggested running perf.py twice with the binaries swapped
Yeah, I had some experience with perf.py when its results were skewed depending on what you test first.
Have you tried disabling turbo on your CPU? (or any kind of power management that would change the CPU clock depending on the current workload)
On 03.02.2016 18:05, STINNER Victor wrote:
> python -m timeit "sum([x x 1 for x in range(100)])"
If you only want to benchmark x*y, x+y and list-comprehension, you should use a tuple for the iterator.
... and precalculate that in the setup:
python -m timeit -s "loops=tuple(range(100))" "sum([x x 1 for x in loops])"
# python -m timeit "sum([x x 1 for x in range(100)])" 100000 loops, best of 3: 5.74 usec per loop # python -m timeit -s "loops=tuple(range(100))" "sum([x x 1 for x in loops])" 100000 loops, best of 3: 5.56 usec per loop
(python = Python 2.7)
Antoine, yeah, it's probably turbo boost related. There is no easy way to turn it off on mac os x, though. I hope Victor's patch to perf.py will help to mitigate this.
Victor, Marc-Andre,
Updated results of nano-bench (best of 10):
-m timeit -s "loops=tuple(range(100))" "sum([x x 1 for x in loops])" 2.7 8.5 3.5 10.1 3.6 8.91
-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])" 2.7 7.27 3.5 8.2 3.6 7.13
-m timeit -s "loops=tuple(range(100))" "sum([x - x - 1 for x in loops])" 2.7 7.01 3.5 8.1 3.6 6.95
Antoine, Serhiy, I'll upload a new patch soon. Probably Serhiy's idea of using a switch statement will make it slightly faster. I'll also add a fast path for integer division.
Fast patch is already implemented in long_mul(). May be we should just use this function if both arguments are exact int, and apply the switch optimization inside.
Fast patch is already implemented in long_mul(). May be we should just use this function if both arguments are exact int, and apply the switch optimization inside.
Agree.
BTW, what do you think about using __int128 when available? That way we can also optimize twodigit PyLongs.
I don't think. I run benchmarks (for __int128) :-)
I don't think. I run benchmarks (for __int128) :-)
Never mind... Seems that __int128 is still an experimental feature and some versions of clang even had bugs with it.
BTW, what do you think about using __int128 when available? That way we can also optimize twodigit PyLongs.
__int128 is not always available and it will add too much of complexity for possible less gain. There is many ways to optimize the code and we should to choose those of them that have the best gain/complexity ratio.
Lets split the patch on smaller parts: 1) direct using long-specialized functions in ceval.c, and 2) optimize the fast path in these functions, and test them separately and combined. May be only one way will add a gain.
Attaching a second version of the patch. (BTW, Serhiy, I tried your idea of using a switch statement to optimize branches (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no detectable speed improvement).
I decided to add fast path for floats & single-digit longs and their combinations. +, -, *, /, //, and their inplace versions are optimized now.
I'll have a full result of macro-benchmarks run tomorrow morning, but here's a result for spectral_norm (rigorous run, best of 3):
### spectral_norm ### Min: 0.300269 -> 0.233037: 1.29x faster Avg: 0.301700 -> 0.234282: 1.29x faster Significant (t=399.89) Stddev: 0.00147 -> 0.00083: 1.7619x smaller
Some nano-benchmarks (best of 3):
-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])" 2.7 7.23 3.5 8.17 3.6 7.57
-m timeit -s "loops=tuple(range(100))" "sum([x + x + 1.0 for x in loops])" 2.7 9.08 3.5 11.7 3.6 7.22
-m timeit -s "loops=tuple(range(100))" "sum([x/2.2 + 2 + x*2.5 + 1.0 for x in loops])" 2.7 17.9 3.5 24.3 3.6 11.8
On 04.02.2016 07:02, Yury Selivanov wrote:
Attaching a second version of the patch. (BTW, Serhiy, I tried your idea of using a switch statement to optimize branches (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no detectable speed improvement).
It would be better to consistently have the fast_*() helpers return -1 in case of an error, instead of -1 or 1.
Overall, I see two problems with doing too many of these fast paths:
the ceval loop may no longer fit in to the CPU cache on systems with small cache sizes, since the compiler will likely inline all the fast_*() functions (I guess it would be possible to simply eliminate all fast paths using a compile time flag)
maintenance will get more difficult
In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead.
Since 'a'+'b' is a very often used instruction type in the latter type of applications, please make sure that this fast path gets more priority in your patch.
Please also check the effects of the fast paths for cases where they don't trigger, e.g. 'a'+'b' or 'a'*2.
Thanks, -- Marc-Andre Lemburg eGenix.com
"In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead."
Hum, I disagree. See benchmark results in other messages. Examples:
### django_v2 ### Min: 2.682884 -> 2.633110: 1.02x faster
### unpickle_list ### Min: 1.333952 -> 1.212805: 1.10x faster
These benchmarks are not written for numeric, but are more "general" benchmarks. int is just a core feature of Python, simply used everywhere, as the str type.
+ if (Py_SIZE(left) != 0) { + if (Py_SIZE(right) != 0) { + +#ifdef HAVE_LONG_LONG + mul = PyLong_FromLongLong( + (long long)SINGLE_DIGIT_LONG_AS_LONG(left) * + SINGLE_DIGIT_LONG_AS_LONG(right)); +#else + mul = PyNumber_Multiply(left, right); +#endif
Why don't you use the same code than long_mul() (you need #include "longintrepr.h")? ----------------
stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
#ifdef HAVE_LONG_LONG
return PyLong_FromLongLong((PY_LONG_LONG)v);
#else
/* if we don't have long long then we're almost certainly
using 15-bit digits, so v will fit in a long. In the
unlikely event that we're using 30-bit digits on a platform
without long long, a large v will just cause us to fall
through to the general multiplication code below. */
if (v >= LONG_MIN && v <= LONG_MAX)
return PyLong_FromLong((long)v);
#endif
I guess that long_mul() was always well optimized, no need to experiment something new.
On 04.02.2016 09:01, STINNER Victor wrote:
"In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead."
Hum, I disagree. See benchmark results in other messages. Examples:
django_v2
Min: 2.682884 -> 2.633110: 1.02x faster
unpickle_list
Min: 1.333952 -> 1.212805: 1.10x faster
These benchmarks are not written for numeric, but are more "general" benchmarks. int is just a core feature of Python, simply used everywhere, as the str type.
Sure, some integer math is used in text applications as well, e.g. for indexing, counting and slicing, but the patch puts more emphasis on numeric operations, e.g. fast_add() tests for integers and floats before then coming back to check for Unicode.
It would be interesting to know how often these paths trigger or not in the various benchmarks.
BTW: The django_v2 benchmark result does not really say anything much. Values of +/- 2% do not have much meaning in benchmark results :-)
I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.
The point of optimizing integers is that they are used for many purposes, not only "math" (e.g. indexing).
I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.
Oh wait, I maybe misunderstood Marc-Andre comment. If the question is only on float: I'm ok to drop the fast-path for float. By the way, I would prefer to see PyLong_CheckExact() in the main loop, and only call fast_mul() if both operands are Python int.
fastint2.patch adds small regression for string multiplication:
$ ./python -m timeit -s "x = 'x'" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "
Unpatched: 1.46 usec per loop
Patched: 1.54 usec per loop
Here is an alternative patch. It just uses existing specialized functions for integers: long_add, long_sub and long_mul. It doesn't add regression for above example with string multiplication, and it looks faster than fastint2.patch for integer multiplication.
$ ./python -m timeit -s "x = 12345" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "
Unpatched: 0.887 usec per loop
fastint2.patch: 0.841 usec per loop
fastint_alt.patch: 0.804 usec per loop
I prefer fastint_alt.patch design, it's simpler. I added a comment on the review.
My numbers, best of 5 timeit runs:
$ ./python -m timeit -s "x = 12345" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "
I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.
I disagree.
30% faster floats (sic!) is a serious improvement, that shouldn't just be discarded. Many applications have floating point calculations one way or another, but don't use numpy because it's an overkill.
Python 2 is much faster than Python 3 on any kind of numeric calculations. This point is being frequently brought up in every python2 vs 3 debate. I think it's unacceptable.
- the ceval loop may no longer fit in to the CPU cache on systems with small cache sizes, since the compiler will likely inline all the fast_*() functions (I guess it would be possible to simply eliminate all fast paths using a compile time flag)
That's a speculation. It may still fit. Or it had never really fitted, it's already huge. I tested the patch on a 8 year old desktop CPU, no performance degradation on our benchmarks.
### raytrace ### Avg: 1.858527 -> 1.652754: 1.12x faster
### nbody ### Avg: 0.310281 -> 0.285179: 1.09x faster
### float ### Avg: 0.392169 -> 0.358989: 1.09x faster
### chaos ### Avg: 0.355519 -> 0.326400: 1.09x faster
### spectral_norm ### Avg: 0.377147 -> 0.303928: 1.24x faster
### telco ### Avg: 0.012845 -> 0.013006: 1.01x slower
The last benchmark (telco) is especially interesting. It uses decimals for calculation, that means that it uses overloaded numeric operators. Still no significant performance degradation.
- maintenance will get more difficult
Fast path for floats is easy to understand for every core dev that works with ceval, there is no rocket science there (if you want rocket science that is hard to maintain look at generators/yield from). If you don't like inlining floating point calculations, we can make float_add, float_sub, float_div, and float_mul exported and use them in ceval.
Why not combine my patch and Serhiy's? First we check if left & right are both longs. Then we check if they are unicode (for +). And then we have a fastpath for floats.
Why not combine my patch and Serhiy's? First we check if left & right are both longs. Then we check if they are unicode (for +). And then we have a fastpath for floats.
See my comment on Serhiy's patch. Maybe we can start by check that the type of both operands are the same, and then use PyLong_CheckExact and PyUnicode_CheckExact.
Using such design, we may add a _PyFloat_Add(). But the next question is then the overhead on the "slow" path, which requires a benchmark too! For example, use a subtype of int.
Le 04/02/2016 14:54, Yury Selivanov a écrit :
30% faster floats (sic!) is a serious improvement, that shouldn't just be discarded. Many applications have floating point calculations one way or another, but don't use numpy because it's an overkill.
Can you give any example of such an application and how they would actually benefit from "faster floats"? I'm curious why anyone who wants fast FP calculations would use pure Python with CPython...
Discarding Numpy because it's "overkill" sounds misguided to me. That's like discarding asyncio because it's "less overkill" to write your own select() loop. It's often far more productive to use the established, robust, optimized library rather than tweak your own low-level code.
(and Numpy is easier to learn than asyncio ;-))
I'm not violently opposing the patch, but I think maintenance effort devoted to such micro-optimizations is a bit wasted. And once you add such a micro-optimization, trying to remove it often faces a barrage of FUD about making Python slower, even if the micro-optimization is practically worthless.
Python 2 is much faster than Python 3 on any kind of numeric calculations.
Actually, it shouldn't really be faster on FP calculations, since the float object hasn't changed (as opposed to int/long). So I'm skeptical of FP-heavy code that would have been made slower by Python 3 (unless there's also integer handling in that, e.g. indexing).
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/1st1' closed_at =
created_at =
labels = ['interpreter-core', 'performance']
title = 'ceval.c: implement fast path for integers with a single digit'
updated_at =
user = 'https://github.com/vstinner'
```
bugs.python.org fields:
```python
activity =
actor = 'vstinner'
assignee = 'yselivanov'
closed = True
closed_date =
closer = 'vstinner'
components = ['Interpreter Core']
creation =
creator = 'vstinner'
dependencies = []
files = ['35961', '35964', '35965', '36021', '41795', '41796', '41799', '41801', '41807', '41811', '41815', '41821', '41822', '41823', '41829', '41830', '41831', '41832', '41843', '41882', '45150']
hgrepos = []
issue_num = 21955
keywords = ['patch']
message_count = 111.0
messages = ['222731', '222804', '222824', '222829', '222830', '222985', '223162', '223177', '223180', '223186', '223214', '223623', '223711', '223726', '238437', '238455', '258057', '258060', '258062', '259417', '259428', '259429', '259431', '259490', '259491', '259493', '259494', '259495', '259496', '259497', '259499', '259500', '259502', '259503', '259505', '259506', '259508', '259509', '259530', '259540', '259541', '259542', '259545', '259549', '259552', '259554', '259560', '259562', '259563', '259564', '259565', '259567', '259568', '259571', '259573', '259574', '259577', '259578', '259601', '259605', '259607', '259612', '259614', '259626', '259663', '259664', '259666', '259667', '259668', '259669', '259670', '259671', '259672', '259673', '259675', '259678', '259695', '259702', '259706', '259707', '259712', '259713', '259714', '259722', '259729', '259730', '259733', '259734', '259743', '259790', '259791', '259792', '259793', '259800', '259801', '259804', '259832', '259859', '259860', '259918', '259919', '259948', '259999', '264018', '264019', '279021', '279022', '279023', '279026', '279027', '377777']
nosy_count = 13.0
nosy_names = ['lemburg', 'rhettinger', 'mark.dickinson', 'pitrou', 'vstinner', 'casevh', 'skrah', 'Yury.Selivanov', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'zbyrne']
pr_nums = ['22481']
priority = 'normal'
resolution = 'rejected'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue21955'
versions = ['Python 3.6']
```