python / cpython

The Python programming language
https://www.python.org
Other
62.28k stars 29.92k forks source link

Caching infrastructure for the evaluation loop: specialised opcodes #86281

Closed pablogsal closed 2 years ago

pablogsal commented 3 years ago
BPO 42115
Nosy @gvanrossum, @warsaw, @nascheme, @methane, @markshannon, @1st1, @gvanrossum, @corona10, @pablogsal, @jdahlin, @isidentical, @Fidget-Spinner
PRs
  • python/cpython#22907
  • python/cpython#23503
  • Files
  • results.zip
  • result.zip
  • patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-C-API', '3.10'] title = 'Caching infrastructure for the evaluation loop: specialised opcodes' updated_at = user = 'https://github.com/pablogsal' ``` bugs.python.org fields: ```python activity = actor = 'kj' assignee = 'none' closed = True closed_date = closer = 'kj' components = ['C API'] creation = creator = 'pablogsal' dependencies = [] files = ['49684', '49687', '49776'] hgrepos = [] issue_num = 42115 keywords = ['patch'] message_count = 35.0 messages = ['379272', '379273', '379274', '379276', '379277', '379278', '379282', '379284', '379333', '380142', '381777', '383034', '383051', '383116', '383152', '383160', '383162', '383260', '383261', '383262', '384463', '384466', '384470', '384474', '384497', '384526', '384534', '385922', '385966', '385967', '385971', '393295', '393348', '404481', '404491'] nosy_count = 13.0 nosy_names = ['gvanrossum', 'barry', 'nascheme', 'methane', 'Mark.Shannon', 'yselivanov', 'Guido.van.Rossum', 'corona10', 'pablogsal', 'Johan Dahlin', 'BTaskaya', 'kj', 'bismatrimony'] pr_nums = ['22907', '23503'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue42115' versions = ['Python 3.10'] ```

    pablogsal commented 3 years ago

    After https://bugs.python.org/issue42093 and https://bugs.python.org/issue26219 is being clear that we can leverage some cache for different information in the evaluation loop to speed up CPython. This observation is also based on the fact that although Python is dynamic, there is plenty of code that does not exercise said dynamism and therefore factoring out the "dynamic" parts of the execution by using a cache mechanism can yield excellent results.

    So far we have two big improvements in performance for caching LOAD_ATTR and LOAD_GLOBAL (in some cases up to 10-14%) but I think we can do much much more. Here are some observations of what I think we can do:

    Obviously, mutating code objects is scary, so we could have some "specialized" version of the bytecode in the cache and use that if is present. Ideas that we could do with this cached stuff:

    The plan will be:

    pablogsal commented 3 years ago

    To clarify what I mean with:

    • We could also do the same for operations like "some_container[]" if the container is some builtin. We can substitute/specialize the opcode for someone that directly uses built-in operations instead of the generic BINARY_SUBSCR.

    If a given function has a BINARY_SUBSCR opcode and when executing it a given number of times we see that the object to get the subscript is always a list, we change the BINARY_SUBSCR to BINARY_SUBSCR_FOR_LISTS and it that opcode we do a quick check for PyList_CheckExact and if is correct we call "Objects/listobject.c:list_subscript" directly and if is false we put back a BINARY_SUBSCR.

    pablogsal commented 3 years ago

    Also, many of these ideas are not new, and many of them are inspired or taken from Yury's email (https://mail.python.org/pipermail/python-dev/2016-January/142945.html) but I wanted to add that I think that with some coordination between us we can achieve some excellent speedups for Python 3.10!

    1st1 commented 3 years ago

    Few thoughts in no particular order:

    pablogsal commented 3 years ago
    • Rewriting code objects in place is wrong, IMO: you always need to have a way to deoptimize the entire thing, so you need to keep the original one. It might be that you have well defined and static types for the first 10000 invocations and something entirely different on 10001. So IMO we need a SpecializedCode object with the necessary bailout guards.

    Imagine that we have a secondary copy of the bytecode in the cache inside the code object and we mutate that instead. The key difference with the current cache infrastructure is that we don't accumulate all the optimizations on the same opcode, which can be very verbose. Instead, we change the generic opcode to a more specialised to optimize and we change it back to deoptimize. The advantage is that BINARY_SUBSCRIPT for example won't be this gigantic block of text that will do different things depending if is specialising for dicts or lists or tuples, but we will have a different opcode for every of them, which I think is much easier to manage.

    1st1 commented 3 years ago

    Imagine that we have a secondary copy of the bytecode in the cache inside the code object and we mutate that instead. The key difference with the current cache infrastructure is that we don't accumulate all the optimizations on the same opcode, which can be very verbose. Instead, we change the generic opcode to a more specialised to optimize and we change it back to deoptimize.

    Yeah, I follow. As long as we keep the original list of opcodes we're good ;)

    methane commented 3 years ago

    FWIW, php7 is about 5x faster than Python on spectral norm benchmark. https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/php-python3.html

    There two major reasons:

    Source code is here: php: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/spectralnorm-php-1.html Python: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/spectralnorm-python3-8.html

    The most hot function is eval_A()

    def eval_A(i, j): # i and j are int.
        ij = i + j
        return ij * (ij + 1) // 2 + i + 1

    And its bytecode:

    Disassembly of <code object eval_A at 0x107fd8500, file "x.py", line 1>:
      2           0 LOAD_FAST                0 (i)
                  2 LOAD_FAST                1 (j)
                  4 BINARY_ADD
                  6 STORE_FAST               2 (ij)
    
      3           8 LOAD_FAST                2 (ij)
                 10 LOAD_FAST                2 (ij)
                 12 LOAD_CONST               1 (1)
                 14 BINARY_ADD
                 16 BINARY_MULTIPLY
                 18 LOAD_CONST               2 (2)
                 20 BINARY_FLOOR_DIVIDE
                 22 LOAD_FAST                0 (i)
                 24 BINARY_ADD
                 26 LOAD_CONST               1 (1)
                 28 BINARY_ADD
                 30 RETURN_VALUE

    My thoughts:

    methane commented 3 years ago

    One more idea: BINARY_ADD_INT. Operand is int immediate. This idea can be implemented without opcode cache. I will try it.

    1st1 commented 3 years ago

    This idea can be implemented without opcode cache. I will try it.

    I'd actually encourage trying to use the opcode cache because this way the optimization will be more generic. E.g. decimal + decimal would also be specialized via the cache, because you'd cache a pointer to the specific + operator implementation. I'm really not sure that adding specialized byte code is a good idea.

    • PHP uses scalar type for float and int

    While at it, maybe we push the number of bits per int digit to 60?

    vstinner commented 3 years ago

    because you'd cache a pointer to the specific + operator implementation

    You should have a look at "INCA: Inline Caching meets Quickening in Python 3.3": https://bugs.python.org/issue14757

    Stefan Brunthaler wrote a paper on his work: "Inline Caching Meets Quickening" (Published in ECOOP 2010) https://publications.sba-research.org/publications/ecoop10.pdf

    pablogsal commented 3 years ago

    I may not know the what is the average airspeed velocity of a laden swallow, but I know the average speed of adding a LOAD_METHOD opcode cache as in PR 23503 (measured with PGO + LTO + CPU isolation):

    +-------------------------+--------------------------------------+-------------------------------------------+ | Benchmark | 2020-11-24_18-08-master-0ec34cab9dd4 | 2020-11-24_19-41-load_method-0c43ca99644b | +=========================+======================================+===========================================+ | pidigits | 266 ms | 233 ms: 1.14x faster (-12%) | +-------------------------+--------------------------------------+-------------------------------------------+ | regex_v8 | 32.7 ms | 30.4 ms: 1.07x faster (-7%) | +-------------------------+--------------------------------------+-------------------------------------------+ | unpickle | 17.7 us | 17.1 us: 1.04x faster (-4%) | +-------------------------+--------------------------------------+-------------------------------------------+ | scimark_sparse_mat_mult | 6.21 ms | 5.99 ms: 1.04x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | go | 283 ms | 274 ms: 1.03x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | mako | 20.4 ms | 19.9 ms: 1.03x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | pickle_pure_python | 546 us | 531 us: 1.03x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | logging_simple | 9.58 us | 9.34 us: 1.03x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | telco | 8.21 ms | 8.03 ms: 1.02x faster (-2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | regex_effbot | 3.97 ms | 3.88 ms: 1.02x faster (-2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | regex_dna | 267 ms | 261 ms: 1.02x faster (-2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | pathlib | 21.2 ms | 21.7 ms: 1.02x slower (+2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | xml_etree_iterparse | 129 ms | 132 ms: 1.02x slower (+2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | xml_etree_parse | 186 ms | 193 ms: 1.03x slower (+3%) | +-------------------------+--------------------------------------+-------------------------------------------+

    Not significant (46): scimark_fft; nbody; pickle; pickle_list; pickle_dict; logging_format; scimark_lu; deltablue; chaos; meteor_contest; sqlalchemy_imperative; scimark_sor; raytrace; scimark_monte_carlo; sympy_integrate; unpickle_pure_python; dulwich_log; logging_silent; nqueens; json_loads; crypto_pyaes; hexiom; django_template; sympy_str; regex_compile; sympy_expand; xml_etree_generate; unpack_sequence; spectral_norm; xml_etree_process; sqlite_synth; genshi_xml; pyflate; json_dumps; python_startup_no_site; 2to3; python_startup; genshi_text; unpickle_list; fannkuch; sympy_sum; sqlalchemy_declarative; chameleon; float; richards; tornado_http;

    methane commented 3 years ago

    pidigits and regex_v8 are LOAD_ATTR heavy benchmarks? I will run benchmarks in my machine to confirm the results.

    pablogsal commented 3 years ago

    pidigits and regex_v8 are LOAD_ATTR heavy benchmarks?

    The PR is for LOAD_METHOD infrastructure, not for LOAD_ATTR (There was an incorrect title in the PR that I corrected but the contents are correct).

    I will run benchmarks in my machine to confirm the results.

    Thanks a lot, Inada-san

    methane commented 3 years ago
    $ ./python -m pyperf compare_to master.json load_method.json -G --min-speed=1
    Slower (15):
    - unpack_sequence: 63.2 ns +- 0.8 ns -> 68.5 ns +- 14.8 ns: 1.08x slower (+8%)
    - pathlib: 23.1 ms +- 0.3 ms -> 24.4 ms +- 0.4 ms: 1.05x slower (+5%)
    - scimark_fft: 418 ms +- 2 ms -> 433 ms +- 3 ms: 1.04x slower (+4%)
    - scimark_monte_carlo: 113 ms +- 2 ms -> 116 ms +- 3 ms: 1.03x slower (+3%)
    - unpickle_list: 5.30 us +- 0.03 us -> 5.44 us +- 0.28 us: 1.03x slower (+3%)
    - nqueens: 107 ms +- 1 ms -> 109 ms +- 3 ms: 1.02x slower (+2%)
    - spectral_norm: 154 ms +- 1 ms -> 157 ms +- 1 ms: 1.02x slower (+2%)
    - float: 136 ms +- 1 ms -> 138 ms +- 1 ms: 1.02x slower (+2%)
    - unpickle_pure_python: 324 us +- 3 us -> 329 us +- 4 us: 1.02x slower (+2%)
    - genshi_xml: 71.9 ms +- 1.9 ms -> 73.1 ms +- 1.0 ms: 1.02x slower (+2%)
    - scimark_sparse_mat_mult: 5.54 ms +- 0.03 ms -> 5.63 ms +- 0.03 ms: 1.02x slower (+2%)
    - regex_effbot: 3.45 ms +- 0.14 ms -> 3.49 ms +- 0.04 ms: 1.01x slower (+1%)
    - sqlite_synth: 3.37 us +- 0.06 us -> 3.41 us +- 0.07 us: 1.01x slower (+1%)
    - regex_v8: 27.0 ms +- 0.2 ms -> 27.3 ms +- 0.1 ms: 1.01x slower (+1%)
    - pickle_dict: 29.7 us +- 0.1 us -> 30.1 us +- 0.2 us: 1.01x slower (+1%)

    Faster (10):

    Benchmark hidden because not significant (35): 2to3, chameleon, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, genshi_text, hexiom, json_dumps, nbody, pickle, pickle_list, pidigits, pyflate, python_startup, python_startup_no_site, raytrace, regex_compile, regex_dna, richards, scimark_lu, scimark_sor, sqlalchemy_declarative, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

    Geometric mean: 1.00 (slower)

    pablogsal commented 3 years ago

    Oh, I am quite confused about what's going on with pidigits and regex_v8. I will try to run the benchmarks again. Did you compare the current master against the PR? If that's the case we should rebase the PR it first to make sure we are comparing it correctly.

    pablogsal commented 3 years ago

    OK, I have repeated the benchmarks after rebasing and this is what I get:

    venv ❯ python -m pyperf compare_to json/2020-12-16_11-20-master-8203c73f3bb1.json.gz json/2020-12-16_11-22-load_method-21b1566125b3.json.gz -G --min-speed=1 Slower (13):

    Faster (15):

    Benchmark hidden because not significant (32): 2to3, chaos, crypto_pyaes, deltablue, dulwich_log, float, genshi_text, genshi_xml, hexiom, json_dumps, json_loads, logging_silent, nbody, pickle_dict, pyflate, python_startup, python_startup_no_site, scimark_lu, scimark_monte_carlo, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

    pablogsal commented 3 years ago

    I think I am closing the PR as it seems that the gains are not good enough (and there is quite a lot of noise by comparing the benchmarks together).

    1st1 commented 3 years ago

    I think I am closing the PR as it seems that the gains are not good enough (and there is quite a lot of noise by comparing the benchmarks together).

    IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.

    pablogsal commented 3 years ago

    IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.

    I will try to do some prototyping around that to see how much can we gain in that route. In any case, adding LOAD_METHOD support for all kinds of calls should be an improvement by itself even without caching, no?

    1st1 commented 3 years ago

    I will try to do some prototyping around that to see how much can we gain in that route. In any case, adding LOAD_METHOD support for all kinds of calls should be an improvement by itself even without caching, no?

    Exactly.

    As one argument for generalizing of LOAD_METHOD is that I, for example, try not to use kwargs in hot paths because I know it will be slower, which feels very wrong. I shouldn't care of internal implementation details of CPython and focus on writing readable code.

    gvanrossum commented 3 years ago

    I've skimmed several papers by Stefan Brunthaler about something called Quickening that I first found via the Pyston blog, which linked to an ancient bpo issue with a patch created by Stefan (https://bugs.python.org/issue14757).

    The gist seems to be to have extra opcodes that only work for certain situations (e.g. INT_BINARY_ADD). In a hot function we can rewrite opcodes with their specialized counterpart. The new opcode contains a guard that rewrites itself back if the guard fails (and then it stays unoptimized).

    I think the idea is that it's pretty cheap to keep an extra pointer to a "alternate bytecode" array.

    1st1 commented 3 years ago

    The gist seems to be to have extra opcodes that only work for certain situations (e.g. INT_BINARY_ADD). In a hot function we can rewrite opcodes with their specialized counterpart. The new opcode contains a guard that rewrites itself back if the guard fails (and then it stays unoptimized).

    This is also roughly what I suggested in https://bugs.python.org/msg379333. Except that I don't think it's necessary to add new opcodes.

    gvanrossum commented 3 years ago

    Do we have good intuition or data about which operations need speeding up most? Everybody always assumes it's BINARY_ADD, but much Python code isn't actually numeric, and binary operations aren't all that common. (For example, IIRC at my previous employer the hottest code was some kind of web template expansion and another bottleneck was marshalling RPC requests and results.)

    Optimizers usually evolve to be excellent at optimizing the benchmark du jour -- what benchmark do we want to use?

    1st1 commented 3 years ago

    Do we have good intuition or data about which operations need speeding up most? Everybody always assumes it's BINARY_ADD, but much Python code isn't actually numeric, and binary operations aren't all that common.

    IMO, we shouldn't focus too much on optimizing binops. Regular webapps/network kind of code wouldn't benefit from that, and scientific code that uses numpy/ml libraries already offsets most of expensive computation to outside of CPython eval. Instead, I think, we should continue focusing on lowering the cost of function/method calls and attribute access.

    pablogsal commented 3 years ago

    I subscribe that but is also a matter of what are the optimizations with higher ratio impact/(cost+complexity). Caching the function pointers on binary operations or (better IMHO) comparisons is likely a good candidate

    gvanrossum commented 3 years ago

    Undoubtedly -- but impact should be measured on what typical users would see, not on narrow benchmarks.

    pablogsal commented 3 years ago

    Agreed. In that regard the most standard thing that we have is the pyperformance suite, which are almost all macro benchmarks. Is also what is displayed in speed.python.org

    gvanrossum commented 3 years ago

    I had a simpler idea for an inline cache for LOAD_METHOD than python/cpython#67691. The essential part goes like this (sorry for the hybrid C/Python):

    if <optimized etc.>:
        if type == lm->type and type->tp_version_tag == lm->tp_version_tag:
            meth = lm->meth
            SET_TOP(meth)
            PUSH(obj)
            DISPATCH()
    
    name = GETITEM(names, oparg)
    meth_found = _PyObject_GetMethod(obj, name, &meth)
    <error check>
    
    if meth_found:
        SET_TOP(meth)
        PUSH(obj)
        if <optimizing etc.>:
            lm = ...
            lm->type = type
            lm->meth = meth

    \<etc.>

    What am I missing? Why is the hash of the name needed?

    Oh, it's probably because there could still be an overriding value in obj.__dict__. But certainly we could check for type == lm->type before the other checks (HasFeature and tp_version_tag).

    But what if we only did this for classes without an instance dict? That could work for things like list.append and str.find.

    pablogsal commented 3 years ago

    What am I missing? Why is the hash of the name needed?

    To speed up the call to get the method from the dictionary using _PyDict_GetItem_KnownHash. The reason I was not caching the method is that as you mention there could still be an overriding value in the dictionary.

    But what if we only did this for classes without an instance dict?

    This is an interesting idea. For PR23503 seems that the machinery was too costly that it was killing the advantages, but maybe we could simplify this for classes without dictionaries so we can still gain overall.

    pablogsal commented 3 years ago

    I am attaching to this issue a patch with PR 23503 restricted to only classes without dict. I can measure a speedup but is super small (5%) in a function that keeps calling a method for a builtin:

    def foo(obj):
        for _ in range(10000):
            res = obj.index(42)
        return res
    gvanrossum commented 3 years ago

    Thanks!

    The loop overhead is presumably dramatic in that example. I think I measured a somewhat bigger speedup using timeit, which IIRC subtracts the cost of an empty loop. But you're right that 5% on a micro-benchmark is not very exciting.

    I wonder if there are cases where the speedup is larger, e.g. in a class with __slots where the method is defined in a base class (also with __slots). In that case LOAD_METHOD would require two dict lookups, both of which we'd avoid here.

    This could be relevant if we had a hidden class mechanism (I'm playing with that but it's too soon to show anything). Let's keep your patch around for that.

    Has anyone looked at storing the cache literally inline? If we had a way to rewrite bytecode that allowed us to move code (requiring us to fix up jumps) the cache data could be stored inline in the bytecode array (I guess that's why they call it "inline cache" :-). So e.g. LOAD_METHOD could have a layout like this:

    LOAD_METHOD opcode oparg (padding) optimized type tp_version_tag meth

    This would avoid the pointer chasing needed to find the opcache struct, and increase memory locality. Obviously the bytecode would have to be copied to mutable memory.

    We could even make it so that "LOAD_METHOD with opcache entry" is a new opcode, so there would be less overhead for unoptimized code (and for optimized code :-).

    We'd have to do something funky for concurrent invocations of the same function (due to threading or recursion), but that can be dealt with; e.g., the frame could hold a pointer to the bytecode array in use.

    Fidget-Spinner commented 3 years ago

    IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.

    Recently I played around with that idea and extended LOAD/CALL_METHOD to keyword arguments (so CALL_FUNCTION_KW is replaced). I then reapplied Pablo's patch. Here's the pyperformance results (LTO + PGO):

    pyperf compare_to 2021-05-08_03-25-general_load_method-42fcad26a487.json.gz 2021-05-08_08-50-general_load_method-dd24d58ce940.json.gz -G --min-speed=1

    Slower (5):

    Faster (23):

    Benchmark hidden because not significant (30): (...) Geometric mean: 1.01x faster

    Most benchmarks don't use keyword arguments. I'm going to try extend LOAD_METHOD to *args and **kwargs next.

    For anyone interested, here's the diff (unstable) https://github.com/python/cpython/compare/main...Fidget-Spinner:general_load_method

    I hope this helps :).

    gvanrossum commented 3 years ago

    Moving the needle on the pyperformance benchmarks is really hard!

    pablogsal commented 2 years ago

    Closing this as we have been implementing this idea already with the adaptative interpreter

    Fidget-Spinner commented 2 years ago

    For future reference, the following opcodes specialized via the PEP-659 specializing adaptive interpreter:

    Combined instructions: bpo-44900 (2% faster pyperformance)