python / cpython

The Python programming language
https://www.python.org/
Other
60.27k stars 29.16k forks source link

ceval.c: implement fast path for integers with a single digit #66154

Closed vstinner closed 7 years ago

vstinner commented 9 years ago
BPO 21955
Nosy @malemburg, @rhettinger, @mdickinson, @pitrou, @vstinner, @skrah, @serhiy-storchaka, @1st1, @MojoVampire
PRs
  • python/cpython#22481
  • Files
  • 21955.patch
  • bench_long.py
  • inline.patch
  • 21955_2.patch
  • bench_results.txt
  • fastint1.patch
  • fastint2.patch
  • fastint_alt.patch
  • fastintfloat_alt.patch
  • fastint4.patch
  • fastint5.patch
  • bench_long2.py
  • compare.txt
  • compare_to.txt
  • fastint5_2.patch
  • fastint5_3.patch
  • fastint5_4.patch
  • inline-2.patch
  • fastint6.patch
  • mpmath_bench.py
  • fastint6_inline2_json.tar.gz
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/1st1' closed_at = created_at = labels = ['interpreter-core', 'performance'] title = 'ceval.c: implement fast path for integers with a single digit' updated_at = user = 'https://github.com/vstinner' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'yselivanov' closed = True closed_date = closer = 'vstinner' components = ['Interpreter Core'] creation = creator = 'vstinner' dependencies = [] files = ['35961', '35964', '35965', '36021', '41795', '41796', '41799', '41801', '41807', '41811', '41815', '41821', '41822', '41823', '41829', '41830', '41831', '41832', '41843', '41882', '45150'] hgrepos = [] issue_num = 21955 keywords = ['patch'] message_count = 111.0 messages = ['222731', '222804', '222824', '222829', '222830', '222985', '223162', '223177', '223180', '223186', '223214', '223623', '223711', '223726', '238437', '238455', '258057', '258060', '258062', '259417', '259428', '259429', '259431', '259490', '259491', '259493', '259494', '259495', '259496', '259497', '259499', '259500', '259502', '259503', '259505', '259506', '259508', '259509', '259530', '259540', '259541', '259542', '259545', '259549', '259552', '259554', '259560', '259562', '259563', '259564', '259565', '259567', '259568', '259571', '259573', '259574', '259577', '259578', '259601', '259605', '259607', '259612', '259614', '259626', '259663', '259664', '259666', '259667', '259668', '259669', '259670', '259671', '259672', '259673', '259675', '259678', '259695', '259702', '259706', '259707', '259712', '259713', '259714', '259722', '259729', '259730', '259733', '259734', '259743', '259790', '259791', '259792', '259793', '259800', '259801', '259804', '259832', '259859', '259860', '259918', '259919', '259948', '259999', '264018', '264019', '279021', '279022', '279023', '279026', '279027', '377777'] nosy_count = 13.0 nosy_names = ['lemburg', 'rhettinger', 'mark.dickinson', 'pitrou', 'vstinner', 'casevh', 'skrah', 'Yury.Selivanov', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'zbyrne'] pr_nums = ['22481'] priority = 'normal' resolution = 'rejected' stage = 'patch review' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue21955' versions = ['Python 3.6'] ```

    vstinner commented 9 years ago

    Python 2 has fast path in ceval.c for operations (a+b, a-b, etc.) on small integers ("int" type) if the operation does not overflow.

    We loose these fast-path in Python 3 when we dropped the int type in favor of the long type.

    Antoine Pitrou proposed a fast-path, but only for int singletons (integers in the range [-5; 255]): issue bpo-10044. His patch was rejected because it introduces undefined behaviour.

    I propose to reimplemenet Python 2 optimization for long with a single digit, which are the most common numbers.

    Pseudo-code for BINARY_ADD: ---

    if (PyLong_CheckExact(x) && Py_ABS(Py_SIZE(x)) == 1
        && PyLong_CheckExact(y) && Py_ABS(Py_SIZE(y)) == 1)
    {
       stwodigits a = ..., b = ...;
       stwodigits c;
       if (... a+b will not overflow ...) { 
          c = a + b;
          return PyLong_FromLongLong(c);
       }
    }
    /* fall back to PyNumber_Add() */

    The code can be copied from longobject.c, there are already fast-path for single digit numbers. See for example long_mul(): ---

        /* fast path for single-digit multiplication */
        if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
            ....
        }

    As any other optimization, it should be proved to be faster with benchmarks.

    99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 9 years ago

    On: if (... a+b will not overflow ...) {

    Since you limited the optimization for addition to single digit numbers, at least for addition and subtraction, overflow is impossible. The signed twodigit you use for the result is guaranteed to be able to store far larger numbers than addition of single digits can produce. In fact, due to the extra wasted bit on large (30 bit) digits, if you used a fixed width 32 bit type for addition/subtraction, and a fixed width 64 bit type for multiplication, overflow would be impossible regardless of whether you used 15 or 30 bit digits.

    On a related note: Presumably you should check if the abs(size) \<= 1 like in longobject.c, not == 1, or you omit the fast path for 0. Doesn't come up much, not worth paying extra to optimize, but it costs nothing to handle it.

    serhiy-storchaka commented 9 years ago

    Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful.

    vstinner commented 9 years ago

    Serhiy Storchaka added the comment:

    Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful.

    I read that Mark rejected the issue bpo-10044 because it introduces an undefined behaviour.

    vstinner commented 9 years ago

    I'm not interested to work on this issue right now. If anyone is interested, please go ahead!

    rhettinger commented 9 years ago

    There also used to be a fast path for binary subscriptions with integer indexes. I would like to see that performance regression fixed if it can be done cleanly.

    c213c52e-0df0-4e83-9769-cd3941352320 commented 9 years ago

    So I'm trying something pretty similar to Victor's pseudo-code and just using timeit to look for speedups timeit('x+x', 'x=10', number=10000000) before: 1.1934231410000393 1.1988609210002323 1.1998214110003573 1.206968028999654 1.2065417159997196

    after: 1.1698650090002047 1.1705158909999227 1.1752884750003432 1.1744818619999933 1.1741297110002051 1.1760422649999782

    Small improvement. Haven't looked at optimizing BINARY_SUBSCR yet.

    serhiy-storchaka commented 9 years ago

    Thank you Zach. I found even small regression.

    Before:

    $ ./python -m timeit -s "x = 10"  "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
    1000000 loops, best of 3: 1.51 usec per loop

    After:

    $ ./python -m timeit -s "x = 10"  "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
    1000000 loops, best of 3: 1.6 usec per loop
    vstinner commented 9 years ago

    bench_long.py: micro-benchmark for x+y. I confirm a slow down with 21955.patch. IMO you should at least inline PyLong_AsLong() which can be simplified if the number has 0 or 1 digit. Here is my patch "inline.patch" which is 21955.patch with PyLong_AsLong() inlined.

    Benchmark result (patch=21955.patch, inline=inline.patch):

    Common platform: Platform: Linux-3.14.8-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Bits: int=32, long=64, long long=64, size_t=64, void*=64 CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Python unicode implementation: PEP-393 Timer: time.perf_counter

    Platform of campaign orig: Date: 2014-07-16 10:04:27 Python version: 3.5.0a0 (default:08b3ee523577, Jul 16 2014, 10:04:23) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577 tag=tip branch=default date="2014-07-15 13:23 +0300" Timer precision: 40 ns

    Platform of campaign patch: Timer precision: 40 ns Date: 2014-07-16 10:04:01 Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:02:12) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"

    Platform of campaign inline: Timer precision: 31 ns Date: 2014-07-16 10:11:21 Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:10:48) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300"

    --------------------+-------------+---------------+--------------- Tests               |        orig |         patch |         inline --------------------+-------------+---------------+--------------- 1+2                 |   23 ns () |         24 ns |   21 ns (-12%) "1+2" ran 100 times | 1.61 us () | 1.74 us (+7%) | 1.39 us (-14%) --------------------+-------------+---------------+--------------- Total               | 1.64 us (*) | 1.76 us (+7%) | 1.41 us (-14%) --------------------+-------------+---------------+---------------

    (I removed my message because I posted the wrong benchmark output, inline column was missing.)

    serhiy-storchaka commented 9 years ago

    Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255).

    Before:

    $ ./python -m timeit -s "x=10"  "x+x"
    10000000 loops, best of 3: 0.143 usec per loop
    $ ./python -m timeit -s "x=1000"  "x+x"
    1000000 loops, best of 3: 0.247 usec per loop

    After:

    $ ./python -m timeit -s "x=10"  "x+x"
    10000000 loops, best of 3: 0.117 usec per loop
    $ ./python -m timeit -s "x=1000"  "x+x"
    1000000 loops, best of 3: 0.209 usec per loop

    All measures are made with modified timeit (bpo-21988).

    c213c52e-0df0-4e83-9769-cd3941352320 commented 9 years ago

    Well, dont' I feel silly. I confirmed both my regression and the inline speedup using the benchmark Victor added. I wonder if I got my binaries backwards in my first test...

    c213c52e-0df0-4e83-9769-cd3941352320 commented 9 years ago

    I did something similar to BINARY_SUBSCR after looking at the 2.7 source as Raymond suggested. Hopefully I got my binaries straight this time :) The new patch includes Victor's inlining and my new subscript changes.

    Platform of campaign orig: Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 15 2014, 18:11:28) [GCC 4.6.3] Timer precision: 6 ns Date: 2014-07-21 20:28:30

    Platform of campaign patch: Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 21 2014, 20:21:20) [GCC 4.6.3] Timer precision: 20 ns Date: 2014-07-21 20:28:39

    ---------------------+-------------+--------------- Tests                |        orig |          patch ---------------------+-------------+--------------- 1+2                  |  118 ns () |  103 ns (-13%) "1+2" ran 100 times  | 7.28 us () | 5.93 us (-19%) x[1]                 |  120 ns () |   98 ns (-19%) "x[1]" ran 100 times | 7.35 us () | 5.31 us (-28%) ---------------------+-------------+--------------- Total                | 14.9 us (*) | 11.4 us (-23%) ---------------------+-------------+---------------

    pitrou commented 9 years ago

    Please run the actual benchmark suite to get interesting numbers: http://hg.python.org/benchmarks

    c213c52e-0df0-4e83-9769-cd3941352320 commented 9 years ago

    I ran the whole benchmark suite. There are a few that are slower: call_method_slots, float, pickle_dict, and unpack_sequence.

    Report on Linux zach-vbox 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 Total CPU cores: 1

    ### 2to3 ### 24.789549 -> 24.809551: 1.00x slower

    ### call_method_slots ### Min: 1.743554 -> 1.780807: 1.02x slower Avg: 1.751735 -> 1.792814: 1.02x slower Significant (t=-26.32) Stddev: 0.00576 -> 0.01823: 3.1660x larger

    ### call_method_unknown ### Min: 1.828094 -> 1.739625: 1.05x faster Avg: 1.852225 -> 1.806721: 1.03x faster Significant (t=2.28) Stddev: 0.01874 -> 0.24320: 12.9783x larger

    ### call_simple ### Min: 1.353581 -> 1.263386: 1.07x faster Avg: 1.397946 -> 1.302046: 1.07x faster Significant (t=24.28) Stddev: 0.03667 -> 0.03154: 1.1629x smaller

    ### chaos ### Min: 1.199377 -> 1.115550: 1.08x faster Avg: 1.230859 -> 1.146573: 1.07x faster Significant (t=16.24) Stddev: 0.02663 -> 0.02525: 1.0544x smaller

    ### django_v2 ### Min: 2.682884 -> 2.633110: 1.02x faster Avg: 2.747521 -> 2.690486: 1.02x faster Significant (t=9.90) Stddev: 0.02744 -> 0.03010: 1.0970x larger

    ### fastpickle ### Min: 1.751475 -> 1.597340: 1.10x faster Avg: 1.771805 -> 1.613533: 1.10x faster Significant (t=64.81) Stddev: 0.01177 -> 0.01263: 1.0727x larger

    ### float ### Min: 1.254858 -> 1.293067: 1.03x slower Avg: 1.336045 -> 1.365787: 1.02x slower Significant (t=-3.30) Stddev: 0.04851 -> 0.04135: 1.1730x smaller

    ### json_dump_v2 ### Min: 17.871819 -> 16.968647: 1.05x faster Avg: 18.428747 -> 17.483397: 1.05x faster Significant (t=4.10) Stddev: 1.60617 -> 0.27655: 5.8078x smaller

    ### mako ### Min: 0.241614 -> 0.231678: 1.04x faster Avg: 0.253730 -> 0.240585: 1.05x faster Significant (t=8.93) Stddev: 0.01912 -> 0.01327: 1.4417x smaller

    ### mako_v2 ### Min: 0.225664 -> 0.213179: 1.06x faster Avg: 0.234850 -> 0.225984: 1.04x faster Significant (t=10.12) Stddev: 0.01379 -> 0.01391: 1.0090x larger

    ### meteor_contest ### Min: 0.777612 -> 0.758924: 1.02x faster Avg: 0.799580 -> 0.780897: 1.02x faster Significant (t=3.97) Stddev: 0.02482 -> 0.02212: 1.1221x smaller

    ### nbody ### Min: 0.969724 -> 0.883935: 1.10x faster Avg: 0.996416 -> 0.918375: 1.08x faster Significant (t=12.65) Stddev: 0.02426 -> 0.03627: 1.4951x larger

    ### nqueens ### Min: 1.142745 -> 1.128195: 1.01x faster Avg: 1.296659 -> 1.162443: 1.12x faster Significant (t=2.75) Stddev: 0.34462 -> 0.02680: 12.8578x smaller

    ### pickle_dict ### Min: 1.433264 -> 1.467394: 1.02x slower Avg: 1.468122 -> 1.506908: 1.03x slower Significant (t=-7.20) Stddev: 0.02695 -> 0.02691: 1.0013x smaller

    ### raytrace ### Min: 5.454853 -> 5.538799: 1.02x slower Avg: 5.530943 -> 5.676983: 1.03x slower Significant (t=-8.64) Stddev: 0.05152 -> 0.10791: 2.0947x larger

    ### regex_effbot ### Min: 0.205875 -> 0.194776: 1.06x faster Avg: 0.211118 -> 0.198759: 1.06x faster Significant (t=5.10) Stddev: 0.01305 -> 0.01112: 1.1736x smaller

    ### regex_v8 ### Min: 0.141628 -> 0.133819: 1.06x faster Avg: 0.147024 -> 0.140053: 1.05x faster Significant (t=2.72) Stddev: 0.01163 -> 0.01388: 1.1933x larger

    ### richards ### Min: 0.734472 -> 0.727501: 1.01x faster Avg: 0.760795 -> 0.743484: 1.02x faster Significant (t=3.50) Stddev: 0.02778 -> 0.02127: 1.3061x smaller

    ### silent_logging ### Min: 0.344678 -> 0.336087: 1.03x faster Avg: 0.357982 -> 0.347361: 1.03x faster Significant (t=2.76) Stddev: 0.01992 -> 0.01852: 1.0755x smaller

    ### simple_logging ### Min: 1.104831 -> 1.072921: 1.03x faster Avg: 1.146844 -> 1.117068: 1.03x faster Significant (t=4.02) Stddev: 0.03552 -> 0.03848: 1.0833x larger

    ### spectral_norm ### Min: 1.710336 -> 1.688910: 1.01x faster Avg: 1.872578 -> 1.738698: 1.08x faster Significant (t=2.35) Stddev: 0.40095 -> 0.03331: 12.0356x smaller

    ### tornado_http ### Min: 0.849374 -> 0.852209: 1.00x slower Avg: 0.955472 -> 0.916075: 1.04x faster Significant (t=4.82) Stddev: 0.07059 -> 0.04119: 1.7139x smaller

    ### unpack_sequence ### Min: 0.000030 -> 0.000020: 1.52x faster Avg: 0.000164 -> 0.000174: 1.06x slower Significant (t=-13.11) Stddev: 0.00011 -> 0.00013: 1.2256x larger

    ### unpickle_list ### Min: 1.333952 -> 1.212805: 1.10x faster Avg: 1.373228 -> 1.266677: 1.08x faster Significant (t=16.32) Stddev: 0.02894 -> 0.03597: 1.2428x larger

    vstinner commented 9 years ago

    What's the status of this issue?

    c213c52e-0df0-4e83-9769-cd3941352320 commented 9 years ago

    I haven't looked at it since I posted the benchmark results for 21955_2.patch.

    c213c52e-0df0-4e83-9769-cd3941352320 commented 8 years ago

    Anybody still looking at this? I can take another stab at it if it's still in scope.

    1st1 commented 8 years ago

    Anybody still looking at this? I can take another stab at it if it's still in scope.

    There were some visible speedups from your patch -- I think we should merge this optimization. Can you figure why unpack_sequence and other benchmarks were slower?

    c213c52e-0df0-4e83-9769-cd3941352320 commented 8 years ago

    Can you figure why unpack_sequence and other benchmarks were slower? I didn't look really closely, A few of the slower ones were floating point heavy, which would incur the slow path penalty, but I can dig into unpack_sequence.

    1st1 commented 8 years ago

    I'm assigning this patch to myself to commit it in 3.6 later.

    c213c52e-0df0-4e83-9769-cd3941352320 commented 8 years ago

    I took another look at this, and tried applying it to 3.6 and running the latest benchmarks. It applied cleanly, and the benchmark results were similar, this time unpack_sequence and spectral_norm were slower. Spectral norm makes sense, it's doing lots of FP addition. The unpack_sequence instruction looks like it already has optimizations for unpacking lists and tuples onto the stack, and running dis on the test showed that it's completely dominated calls to unpack_sequence, load_fast, and store_fast so I still don't know what's going on there.

    pitrou commented 8 years ago

    Any change that increases the cache or branch predictor footprint of the evaluation loop may make the interpreter slower, even if the change doesn't seem related to a particular benchmark. That may be the reason here.

    1st1 commented 8 years ago

    unpack_sequence contains 400 lines of this: "a, b, c, d, e, f, g, h, i, j = to_unpack". This code doesn't even touch BINARY_SUBSCR or BINARY_ADD.

    Zach, could you please run your benchmarks in rigorous mode (perf.py -r)? I'd also suggest to experiment with putting the baseline cpython as a first arg and as a second -- maybe your machine runs the second interpreter slightly faster.

    c213c52e-0df0-4e83-9769-cd3941352320 commented 8 years ago
    I ran 6 benchmarks on my work machine(not the same one as the last set) overnight. Two with just the BINARY_ADD change, two with the BINARY_SUBSCR change, and two with both. I'm attaching the output from all my benchmark runs, but here are the highlights In this table I've flipped the results for running the modified build as the reference, but in the new attachment, slower in the right column means faster, I think :) ------------------ --------------------------------------- ----------------------------------- Build Baseline Reference Modified Reference
    Faster Slower Faster Slower
    ------------------ -------------------- ------------------ -------------------- --------------
    BINARY_ADD chameleon_v2 etree_parse chameleon_v2 call_simple
    chaos nbody fannkuch nbody
    django normal_startup normal_startup pickle_dict
    etree_generate pickle_dict nqueens regex_v8
    fannkuch pickle_list regex_compile
    formatted_logging regex_effbot spectral_norm
    go unpickle_list
    json_load
    regex_compile
    simple_logging
    spectral_norm
    ------------------ -------------------- ------------------ -------------------- --------------
    BINARY_SUBSCR chameleon_v2 call_simple 2to3 etree_parse
    chaos go call_method_slots json_dump_v2
    etree_generate pickle_list chaos pickle_dict
    fannkuch telco fannkuch
    fastpickle formatted_logging
    hexiom2 go
    json_load hexiom2
    mako_v2 mako_v2
    meteor_contest meteor_contest
    nbody nbody
    regex_v8 normal_startup
    spectral_norm nqueens
    pickle_list
    simple_logging
    spectral_norm
    telco
    ------------------ -------------------- ------------------ -------------------- --------------
    BOTH chameleon_v2 call_simple chameleon_v2 fastpickle
    chaos etree_parse choas pickle_dict
    etree_generate pathlib etree_generate pickle_list
    etree_process pickle_list etree_process telco
    fannkuch fannkuch
    fastunpickle float
    float formatted_logging
    formatted_logging go
    hexiom2 hexiom2
    nbody nbody
    nqueens normal_startup
    regex_v8 nqueens
    spectral_norm simple_logging
    unpickle_list spectral_norm
    ------------------ -------------------- ------------------ -------------------- --------------

    unpack_sequence is nowhere to be seen and spectral_norm is faster now...

    1st1 commented 8 years ago

    Attaching a new patch -- rewritten to optimize -, *, +, -=, *= and +=. I also removed the optimization of [] operator -- that should be done in a separate patch and in a separate issue.

    Some nano-benchmarks (best of 3):

    python -m timeit "sum([x + x + 1 for x in range(100)])" 2.7: 7.71 3.5: 8.54 3.6: 7.33

    python -m timeit "sum([x - x - 1 for x in range(100)])" 2.7: 7.81 3.5: 8.59 3.6: 7.57

    python -m timeit "sum([x x 1 for x in range(100)])" 2.7: 9.28 3.5: 10.6 3.6: 9.44

    Python 3.6 vs 3.5 (spectral_norm, rigorous run): Min: 0.315917 -> 0.276785: 1.14x faster Avg: 0.321006 -> 0.284909: 1.13x faster

    Zach, thanks a lot for the research! I'm glad that unpack_sequence finally proved to be irrelevant. Could you please take a look at the updated patch?

    vstinner commented 8 years ago

    python -m timeit "sum([x x 1 for x in range(100)])"

    If you only want to benchmark x*y, x+y and list-comprehension, you should use a tuple for the iterator.

    pitrou commented 8 years ago

    In this table I've flipped the results for running the modified build > as the reference, but in the new attachment, slower in the right column means faster, I think :)

    I don't understand what this table means (why 4 columns?). Can you explain what you did?

    c213c52e-0df0-4e83-9769-cd3941352320 commented 8 years ago

    I don't understand what this table means (why 4 columns?). Can you explain what you did?

    Yury suggested running perf.py twice with the binaries swapped So "faster" and "slower" underneath "Baseline Reference" are runs where the unmodified python binary was the first argument to perf, and the "Modified Reference" is where the patched binary is the first argument.

    ie. "perf.py -r -b all python patched_python" vs "perf.py -r -b all patched_python python"

    bench_results.txt has the actual output in it, and the "slower in the right column" comment was referring to the contents of that file, not the table. Sorry for the confusion.

    1st1 commented 8 years ago

    Yury suggested running perf.py twice with the binaries swapped

    Yeah, I had some experience with perf.py when its results were skewed depending on what you test first. Hopefully Victor's new patch will fix that http://bugs.python.org/issue26275

    c213c52e-0df0-4e83-9769-cd3941352320 commented 8 years ago

    Could you please take a look at the updated patch? Looks ok to me, for whatever that's worth.

    pitrou commented 8 years ago

    Le 03/02/2016 18:21, Yury Selivanov a écrit :

    Yury Selivanov added the comment:

    > Yury suggested running perf.py twice with the binaries swapped

    Yeah, I had some experience with perf.py when its results were skewed depending on what you test first.

    Have you tried disabling turbo on your CPU? (or any kind of power management that would change the CPU clock depending on the current workload)

    malemburg commented 8 years ago

    On 03.02.2016 18:05, STINNER Victor wrote:

    > python -m timeit "sum([x x 1 for x in range(100)])"

    If you only want to benchmark x*y, x+y and list-comprehension, you should use a tuple for the iterator.

    ... and precalculate that in the setup:

    python -m timeit -s "loops=tuple(range(100))" "sum([x x 1 for x in loops])"

    # python -m timeit "sum([x x 1 for x in range(100)])" 100000 loops, best of 3: 5.74 usec per loop # python -m timeit -s "loops=tuple(range(100))" "sum([x x 1 for x in loops])" 100000 loops, best of 3: 5.56 usec per loop

    (python = Python 2.7)

    1st1 commented 8 years ago

    Antoine, yeah, it's probably turbo boost related. There is no easy way to turn it off on mac os x, though. I hope Victor's patch to perf.py will help to mitigate this.

    Victor, Marc-Andre,

    Updated results of nano-bench (best of 10):

    -m timeit -s "loops=tuple(range(100))" "sum([x x 1 for x in loops])" 2.7 8.5 3.5 10.1 3.6 8.91

    -m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])" 2.7 7.27 3.5 8.2 3.6 7.13

    -m timeit -s "loops=tuple(range(100))" "sum([x - x - 1 for x in loops])" 2.7 7.01 3.5 8.1 3.6 6.95

    Antoine, Serhiy, I'll upload a new patch soon. Probably Serhiy's idea of using a switch statement will make it slightly faster. I'll also add a fast path for integer division.

    serhiy-storchaka commented 8 years ago

    Fast patch is already implemented in long_mul(). May be we should just use this function if both arguments are exact int, and apply the switch optimization inside.

    1st1 commented 8 years ago

    Fast patch is already implemented in long_mul(). May be we should just use this function if both arguments are exact int, and apply the switch optimization inside.

    Agree.

    BTW, what do you think about using __int128 when available? That way we can also optimize twodigit PyLongs.

    vstinner commented 8 years ago

    I don't think. I run benchmarks (for __int128) :-)

    1st1 commented 8 years ago

    I don't think. I run benchmarks (for __int128) :-)

    Never mind... Seems that __int128 is still an experimental feature and some versions of clang even had bugs with it.

    serhiy-storchaka commented 8 years ago

    BTW, what do you think about using __int128 when available? That way we can also optimize twodigit PyLongs.

    __int128 is not always available and it will add too much of complexity for possible less gain. There is many ways to optimize the code and we should to choose those of them that have the best gain/complexity ratio.

    Lets split the patch on smaller parts: 1) direct using long-specialized functions in ceval.c, and 2) optimize the fast path in these functions, and test them separately and combined. May be only one way will add a gain.

    1st1 commented 8 years ago

    Attaching a second version of the patch. (BTW, Serhiy, I tried your idea of using a switch statement to optimize branches (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no detectable speed improvement).

    I decided to add fast path for floats & single-digit longs and their combinations. +, -, *, /, //, and their inplace versions are optimized now.

    I'll have a full result of macro-benchmarks run tomorrow morning, but here's a result for spectral_norm (rigorous run, best of 3):

    ### spectral_norm ### Min: 0.300269 -> 0.233037: 1.29x faster Avg: 0.301700 -> 0.234282: 1.29x faster Significant (t=399.89) Stddev: 0.00147 -> 0.00083: 1.7619x smaller

    Some nano-benchmarks (best of 3):

    -m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])" 2.7 7.23 3.5 8.17 3.6 7.57

    -m timeit -s "loops=tuple(range(100))" "sum([x + x + 1.0 for x in loops])" 2.7 9.08 3.5 11.7 3.6 7.22

    -m timeit -s "loops=tuple(range(100))" "sum([x/2.2 + 2 + x*2.5 + 1.0 for x in loops])" 2.7 17.9 3.5 24.3 3.6 11.8

    malemburg commented 8 years ago

    On 04.02.2016 07:02, Yury Selivanov wrote:

    Attaching a second version of the patch. (BTW, Serhiy, I tried your idea of using a switch statement to optimize branches (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no detectable speed improvement).

    It would be better to consistently have the fast_*() helpers return -1 in case of an error, instead of -1 or 1.

    Overall, I see two problems with doing too many of these fast paths:

    In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead.

    Since 'a'+'b' is a very often used instruction type in the latter type of applications, please make sure that this fast path gets more priority in your patch.

    Please also check the effects of the fast paths for cases where they don't trigger, e.g. 'a'+'b' or 'a'*2.

    Thanks, -- Marc-Andre Lemburg eGenix.com

    vstinner commented 8 years ago

    "In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead."

    Hum, I disagree. See benchmark results in other messages. Examples:

    ### django_v2 ### Min: 2.682884 -> 2.633110: 1.02x faster

    ### unpickle_list ### Min: 1.333952 -> 1.212805: 1.10x faster

    These benchmarks are not written for numeric, but are more "general" benchmarks. int is just a core feature of Python, simply used everywhere, as the str type.

    vstinner commented 8 years ago

    + if (Py_SIZE(left) != 0) { + if (Py_SIZE(right) != 0) { + +#ifdef HAVE_LONG_LONG + mul = PyLong_FromLongLong( + (long long)SINGLE_DIGIT_LONG_AS_LONG(left) * + SINGLE_DIGIT_LONG_AS_LONG(right)); +#else + mul = PyNumber_Multiply(left, right); +#endif

    Why don't you use the same code than long_mul() (you need #include "longintrepr.h")? ----------------

            stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
    #ifdef HAVE_LONG_LONG
            return PyLong_FromLongLong((PY_LONG_LONG)v);
    #else
            /* if we don't have long long then we're almost certainly
               using 15-bit digits, so v will fit in a long.  In the
               unlikely event that we're using 30-bit digits on a platform
               without long long, a large v will just cause us to fall
               through to the general multiplication code below. */
            if (v >= LONG_MIN && v <= LONG_MAX)
                return PyLong_FromLong((long)v);
    #endif

    I guess that long_mul() was always well optimized, no need to experiment something new.

    malemburg commented 8 years ago

    On 04.02.2016 09:01, STINNER Victor wrote:

    "In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead."

    Hum, I disagree. See benchmark results in other messages. Examples:

    django_v2

    Min: 2.682884 -> 2.633110: 1.02x faster

    unpickle_list

    Min: 1.333952 -> 1.212805: 1.10x faster

    These benchmarks are not written for numeric, but are more "general" benchmarks. int is just a core feature of Python, simply used everywhere, as the str type.

    Sure, some integer math is used in text applications as well, e.g. for indexing, counting and slicing, but the patch puts more emphasis on numeric operations, e.g. fast_add() tests for integers and floats before then coming back to check for Unicode.

    It would be interesting to know how often these paths trigger or not in the various benchmarks.

    BTW: The django_v2 benchmark result does not really say anything much. Values of +/- 2% do not have much meaning in benchmark results :-)

    pitrou commented 8 years ago

    I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.

    The point of optimizing integers is that they are used for many purposes, not only "math" (e.g. indexing).

    vstinner commented 8 years ago

    I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.

    Oh wait, I maybe misunderstood Marc-Andre comment. If the question is only on float: I'm ok to drop the fast-path for float. By the way, I would prefer to see PyLong_CheckExact() in the main loop, and only call fast_mul() if both operands are Python int.

    serhiy-storchaka commented 8 years ago

    fastint2.patch adds small regression for string multiplication:

    $ ./python -m timeit -s "x = 'x'" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "
    Unpatched:  1.46 usec per loop
    Patched:    1.54 usec per loop

    Here is an alternative patch. It just uses existing specialized functions for integers: long_add, long_sub and long_mul. It doesn't add regression for above example with string multiplication, and it looks faster than fastint2.patch for integer multiplication.

    $ ./python -m timeit -s "x = 12345" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "
    Unpatched:          0.887 usec per loop
    fastint2.patch:     0.841 usec per loop
    fastint_alt.patch:  0.804 usec per loop
    vstinner commented 8 years ago

    I prefer fastint_alt.patch design, it's simpler. I added a comment on the review.

    My numbers, best of 5 timeit runs:

    $ ./python -m timeit -s "x = 12345" -- "x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; x*2; "
    1st1 commented 8 years ago

    I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO.

    I disagree.

    30% faster floats (sic!) is a serious improvement, that shouldn't just be discarded. Many applications have floating point calculations one way or another, but don't use numpy because it's an overkill.

    Python 2 is much faster than Python 3 on any kind of numeric calculations. This point is being frequently brought up in every python2 vs 3 debate. I think it's unacceptable.

    • the ceval loop may no longer fit in to the CPU cache on systems with small cache sizes, since the compiler will likely inline all the fast_*() functions (I guess it would be possible to simply eliminate all fast paths using a compile time flag)

    That's a speculation. It may still fit. Or it had never really fitted, it's already huge. I tested the patch on a 8 year old desktop CPU, no performance degradation on our benchmarks.

    ### raytrace ### Avg: 1.858527 -> 1.652754: 1.12x faster

    ### nbody ### Avg: 0.310281 -> 0.285179: 1.09x faster

    ### float ### Avg: 0.392169 -> 0.358989: 1.09x faster

    ### chaos ### Avg: 0.355519 -> 0.326400: 1.09x faster

    ### spectral_norm ### Avg: 0.377147 -> 0.303928: 1.24x faster

    ### telco ### Avg: 0.012845 -> 0.013006: 1.01x slower

    The last benchmark (telco) is especially interesting. It uses decimals for calculation, that means that it uses overloaded numeric operators. Still no significant performance degradation.

    • maintenance will get more difficult

    Fast path for floats is easy to understand for every core dev that works with ceval, there is no rocket science there (if you want rocket science that is hard to maintain look at generators/yield from). If you don't like inlining floating point calculations, we can make float_add, float_sub, float_div, and float_mul exported and use them in ceval.

    Why not combine my patch and Serhiy's? First we check if left & right are both longs. Then we check if they are unicode (for +). And then we have a fastpath for floats.

    vstinner commented 8 years ago

    Why not combine my patch and Serhiy's? First we check if left & right are both longs. Then we check if they are unicode (for +). And then we have a fastpath for floats.

    See my comment on Serhiy's patch. Maybe we can start by check that the type of both operands are the same, and then use PyLong_CheckExact and PyUnicode_CheckExact.

    Using such design, we may add a _PyFloat_Add(). But the next question is then the overhead on the "slow" path, which requires a benchmark too! For example, use a subtype of int.

    pitrou commented 8 years ago

    Le 04/02/2016 14:54, Yury Selivanov a écrit :

    30% faster floats (sic!) is a serious improvement, that shouldn't just be discarded. Many applications have floating point calculations one way or another, but don't use numpy because it's an overkill.

    Can you give any example of such an application and how they would actually benefit from "faster floats"? I'm curious why anyone who wants fast FP calculations would use pure Python with CPython...

    Discarding Numpy because it's "overkill" sounds misguided to me. That's like discarding asyncio because it's "less overkill" to write your own select() loop. It's often far more productive to use the established, robust, optimized library rather than tweak your own low-level code.

    (and Numpy is easier to learn than asyncio ;-))

    I'm not violently opposing the patch, but I think maintenance effort devoted to such micro-optimizations is a bit wasted. And once you add such a micro-optimization, trying to remove it often faces a barrage of FUD about making Python slower, even if the micro-optimization is practically worthless.

    Python 2 is much faster than Python 3 on any kind of numeric calculations.

    Actually, it shouldn't really be faster on FP calculations, since the float object hasn't changed (as opposed to int/long). So I'm skeptical of FP-heavy code that would have been made slower by Python 3 (unless there's also integer handling in that, e.g. indexing).