Caching infrastructure for the evaluation loop: specialised opcodes

pablogsal commented 3 years ago

BPO	42115
Nosy	@gvanrossum, @warsaw, @nascheme, @methane, @markshannon, @1st1, @gvanrossum, @corona10, @pablogsal, @jdahlin, @isidentical, @Fidget-Spinner
PRs	python/cpython#22907 python/cpython#23503
Files	results.zip result.zip patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-C-API', '3.10'] title = 'Caching infrastructure for the evaluation loop: specialised opcodes' updated_at = user = 'https://github.com/pablogsal' ``` bugs.python.org fields: ```python activity = actor = 'kj' assignee = 'none' closed = True closed_date = closer = 'kj' components = ['C API'] creation = creator = 'pablogsal' dependencies = [] files = ['49684', '49687', '49776'] hgrepos = [] issue_num = 42115 keywords = ['patch'] message_count = 35.0 messages = ['379272', '379273', '379274', '379276', '379277', '379278', '379282', '379284', '379333', '380142', '381777', '383034', '383051', '383116', '383152', '383160', '383162', '383260', '383261', '383262', '384463', '384466', '384470', '384474', '384497', '384526', '384534', '385922', '385966', '385967', '385971', '393295', '393348', '404481', '404491'] nosy_count = 13.0 nosy_names = ['gvanrossum', 'barry', 'nascheme', 'methane', 'Mark.Shannon', 'yselivanov', 'Guido.van.Rossum', 'corona10', 'pablogsal', 'Johan Dahlin', 'BTaskaya', 'kj', 'bismatrimony'] pr_nums = ['22907', '23503'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue42115' versions = ['Python 3.10'] ```

pablogsal commented 3 years ago

After https://bugs.python.org/issue42093 and https://bugs.python.org/issue26219 is being clear that we can leverage some cache for different information in the evaluation loop to speed up CPython. This observation is also based on the fact that although Python is dynamic, there is plenty of code that does not exercise said dynamism and therefore factoring out the "dynamic" parts of the execution by using a cache mechanism can yield excellent results.

So far we have two big improvements in performance for caching LOAD_ATTR and LOAD_GLOBAL (in some cases up to 10-14%) but I think we can do much much more. Here are some observations of what I think we can do:

Instead of adding more caches using the current mechanism, which adds some inlined code in every opcode in the evaluation loop, we can try to formalize some kind of caching mechanism that has some better API that will allow adding more opcodes in the future. Having the code inline in ceval.c is going to become difficult to maintain if we keep adding more stuff directly there.
Instead of handling the specialization in the same opcode as the original one (LOAD_ATTR is doing the slow and the fast path) we could mutate the original code object and replacing the slow and generic opcodes for the more specialized ones and these will also be in charge of changing it back to the generic and slow ones if the assumptions that activated them appear.

Obviously, mutating code objects is scary, so we could have some "specialized" version of the bytecode in the cache and use that if is present. Ideas that we could do with this cached stuff:

For binary operators, we can grab both operands, resolve the addition function and cache that together with the types and the version tags and if the types and the version tags are the same, use directly the addition function instead of resolving it.
For loading methods, we could cache the bound method as proposed by Yury originally here: https://mail.python.org/pipermail/python-dev/2016-January/142945.html.
We could also do the same for operations like "some_container[]" if the container is some builtin. We can substitute/specialize the opcode for someone that directly uses built-in operations instead of the generic BINARY_SUBSCR.

The plan will be:

Making some infrastructure/framework for the caching that allows us to optimize/deoptimize individual opcodes.
Refactor the existing specialization for LOAD_GLOBAL/LOAD_ATTR to use said infrastructure.
Thinking of what operations could benefit from specialization and start adding them one by one.

pablogsal commented 3 years ago

To clarify what I mean with:

We could also do the same for operations like "some_container[]" if the container is some builtin. We can substitute/specialize the opcode for someone that directly uses built-in operations instead of the generic BINARY_SUBSCR.

If a given function has a BINARY_SUBSCR opcode and when executing it a given number of times we see that the object to get the subscript is always a list, we change the BINARY_SUBSCR to BINARY_SUBSCR_FOR_LISTS and it that opcode we do a quick check for PyList_CheckExact and if is correct we call "Objects/listobject.c:list_subscript" directly and if is false we put back a BINARY_SUBSCR.

pablogsal commented 3 years ago

Also, many of these ideas are not new, and many of them are inspired or taken from Yury's email (https://mail.python.org/pipermail/python-dev/2016-January/142945.html) but I wanted to add that I think that with some coordination between us we can achieve some excellent speedups for Python 3.10!

1st1 commented 3 years ago

Few thoughts in no particular order:

I'd suggest implementing the cache for 2-3 more opcodes on top of the existing infrastructure to get more experience and then refactoring it to make it more generic.
Generalizing LOAD_METHOD to work for methods with **kwargs, caching concrete operator implementations for opcodes like BINARY_ADD etc. are all possible on top of the current infra.
Rewriting code objects in place is wrong, IMO: you always need to have a way to deoptimize the entire thing, so you need to keep the original one. It might be that you have well defined and static types for the first 10000 invocations and something entirely different on 10001. So IMO we need a SpecializedCode object with the necessary bailout guards. But that's not a simple thing to implement, so unless someone will be working on this fulltime for a long time I'd suggest working off what we have now. (That said I'd take a close look at what Dino is building).
There are multiple different approaches we can choose for optimizing CPython, ranging from hidden classes to a full blown JIT. I hope someone will do them one day. But IMO the current simple "opcode cache" (I wish we had a better name) mechanism we have would allow us to squeeze up to 15-25% median improvement in our benchmarks with relatively limited dev time. Maybe that's good enough for 3.10.

pablogsal commented 3 years ago

Rewriting code objects in place is wrong, IMO: you always need to have a way to deoptimize the entire thing, so you need to keep the original one. It might be that you have well defined and static types for the first 10000 invocations and something entirely different on 10001. So IMO we need a SpecializedCode object with the necessary bailout guards.

Imagine that we have a secondary copy of the bytecode in the cache inside the code object and we mutate that instead. The key difference with the current cache infrastructure is that we don't accumulate all the optimizations on the same opcode, which can be very verbose. Instead, we change the generic opcode to a more specialised to optimize and we change it back to deoptimize. The advantage is that BINARY_SUBSCRIPT for example won't be this gigantic block of text that will do different things depending if is specialising for dicts or lists or tuples, but we will have a different opcode for every of them, which I think is much easier to manage.

1st1 commented 3 years ago

Imagine that we have a secondary copy of the bytecode in the cache inside the code object and we mutate that instead. The key difference with the current cache infrastructure is that we don't accumulate all the optimizations on the same opcode, which can be very verbose. Instead, we change the generic opcode to a more specialised to optimize and we change it back to deoptimize.

Yeah, I follow. As long as we keep the original list of opcodes we're good ;)

methane commented 3 years ago

FWIW, php7 is about 5x faster than Python on spectral norm benchmark. https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/php-python3.html

There two major reasons:

PHP uses scalar type for float and int
PHP uses type-specialized bytecode (PHP8 will use JIT, but PHP7 dosn't)

Source code is here: php: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/spectralnorm-php-1.html Python: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/spectralnorm-python3-8.html

The most hot function is eval_A()

def eval_A(i, j): # i and j are int.
    ij = i + j
    return ij * (ij + 1) // 2 + i + 1

And its bytecode:

Disassembly of <code object eval_A at 0x107fd8500, file "x.py", line 1>:
  2           0 LOAD_FAST                0 (i)
              2 LOAD_FAST                1 (j)
              4 BINARY_ADD
              6 STORE_FAST               2 (ij)

  3           8 LOAD_FAST                2 (ij)
             10 LOAD_FAST                2 (ij)
             12 LOAD_CONST               1 (1)
             14 BINARY_ADD
             16 BINARY_MULTIPLY
             18 LOAD_CONST               2 (2)
             20 BINARY_FLOOR_DIVIDE
             22 LOAD_FAST                0 (i)
             24 BINARY_ADD
             26 LOAD_CONST               1 (1)
             28 BINARY_ADD
             30 RETURN_VALUE

My thoughts:

bytecode specialized for int op int will some help.
there are many incref/decref overhead.
- multi operand bytecode (e.g. BINARY_ADD_FAST_FAST, BINARY_ADD_FAST_CONST, etc) will reduce refcount overhead.

methane commented 3 years ago

One more idea: BINARY_ADD_INT. Operand is int immediate. This idea can be implemented without opcode cache. I will try it.

1st1 commented 3 years ago

This idea can be implemented without opcode cache. I will try it.

I'd actually encourage trying to use the opcode cache because this way the optimization will be more generic. E.g. decimal + decimal would also be specialized via the cache, because you'd cache a pointer to the specific + operator implementation. I'm really not sure that adding specialized byte code is a good idea.

PHP uses scalar type for float and int

While at it, maybe we push the number of bits per int digit to 60?

vstinner commented 3 years ago

because you'd cache a pointer to the specific + operator implementation

You should have a look at "INCA: Inline Caching meets Quickening in Python 3.3": https://bugs.python.org/issue14757

Stefan Brunthaler wrote a paper on his work: "Inline Caching Meets Quickening" (Published in ECOOP 2010) https://publications.sba-research.org/publications/ecoop10.pdf

pablogsal commented 3 years ago

I may not know the what is the average airspeed velocity of a laden swallow, but I know the average speed of adding a LOAD_METHOD opcode cache as in PR 23503 (measured with PGO + LTO + CPU isolation):

+-------------------------+--------------------------------------+-------------------------------------------+ | Benchmark | 2020-11-24_18-08-master-0ec34cab9dd4 | 2020-11-24_19-41-load_method-0c43ca99644b | +=========================+======================================+===========================================+ | pidigits | 266 ms | 233 ms: 1.14x faster (-12%) | +-------------------------+--------------------------------------+-------------------------------------------+ | regex_v8 | 32.7 ms | 30.4 ms: 1.07x faster (-7%) | +-------------------------+--------------------------------------+-------------------------------------------+ | unpickle | 17.7 us | 17.1 us: 1.04x faster (-4%) | +-------------------------+--------------------------------------+-------------------------------------------+ | scimark_sparse_mat_mult | 6.21 ms | 5.99 ms: 1.04x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | go | 283 ms | 274 ms: 1.03x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | mako | 20.4 ms | 19.9 ms: 1.03x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | pickle_pure_python | 546 us | 531 us: 1.03x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | logging_simple | 9.58 us | 9.34 us: 1.03x faster (-3%) | +-------------------------+--------------------------------------+-------------------------------------------+ | telco | 8.21 ms | 8.03 ms: 1.02x faster (-2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | regex_effbot | 3.97 ms | 3.88 ms: 1.02x faster (-2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | regex_dna | 267 ms | 261 ms: 1.02x faster (-2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | pathlib | 21.2 ms | 21.7 ms: 1.02x slower (+2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | xml_etree_iterparse | 129 ms | 132 ms: 1.02x slower (+2%) | +-------------------------+--------------------------------------+-------------------------------------------+ | xml_etree_parse | 186 ms | 193 ms: 1.03x slower (+3%) | +-------------------------+--------------------------------------+-------------------------------------------+

Not significant (46): scimark_fft; nbody; pickle; pickle_list; pickle_dict; logging_format; scimark_lu; deltablue; chaos; meteor_contest; sqlalchemy_imperative; scimark_sor; raytrace; scimark_monte_carlo; sympy_integrate; unpickle_pure_python; dulwich_log; logging_silent; nqueens; json_loads; crypto_pyaes; hexiom; django_template; sympy_str; regex_compile; sympy_expand; xml_etree_generate; unpack_sequence; spectral_norm; xml_etree_process; sqlite_synth; genshi_xml; pyflate; json_dumps; python_startup_no_site; 2to3; python_startup; genshi_text; unpickle_list; fannkuch; sympy_sum; sqlalchemy_declarative; chameleon; float; richards; tornado_http;

methane commented 3 years ago

pidigits and regex_v8 are LOAD_ATTR heavy benchmarks? I will run benchmarks in my machine to confirm the results.

pablogsal commented 3 years ago

pidigits and regex_v8 are LOAD_ATTR heavy benchmarks?

The PR is for LOAD_METHOD infrastructure, not for LOAD_ATTR (There was an incorrect title in the PR that I corrected but the contents are correct).

I will run benchmarks in my machine to confirm the results.

Thanks a lot, Inada-san

methane commented 3 years ago

$ ./python -m pyperf compare_to master.json load_method.json -G --min-speed=1
Slower (15):
- unpack_sequence: 63.2 ns +- 0.8 ns -> 68.5 ns +- 14.8 ns: 1.08x slower (+8%)
- pathlib: 23.1 ms +- 0.3 ms -> 24.4 ms +- 0.4 ms: 1.05x slower (+5%)
- scimark_fft: 418 ms +- 2 ms -> 433 ms +- 3 ms: 1.04x slower (+4%)
- scimark_monte_carlo: 113 ms +- 2 ms -> 116 ms +- 3 ms: 1.03x slower (+3%)
- unpickle_list: 5.30 us +- 0.03 us -> 5.44 us +- 0.28 us: 1.03x slower (+3%)
- nqueens: 107 ms +- 1 ms -> 109 ms +- 3 ms: 1.02x slower (+2%)
- spectral_norm: 154 ms +- 1 ms -> 157 ms +- 1 ms: 1.02x slower (+2%)
- float: 136 ms +- 1 ms -> 138 ms +- 1 ms: 1.02x slower (+2%)
- unpickle_pure_python: 324 us +- 3 us -> 329 us +- 4 us: 1.02x slower (+2%)
- genshi_xml: 71.9 ms +- 1.9 ms -> 73.1 ms +- 1.0 ms: 1.02x slower (+2%)
- scimark_sparse_mat_mult: 5.54 ms +- 0.03 ms -> 5.63 ms +- 0.03 ms: 1.02x slower (+2%)
- regex_effbot: 3.45 ms +- 0.14 ms -> 3.49 ms +- 0.04 ms: 1.01x slower (+1%)
- sqlite_synth: 3.37 us +- 0.06 us -> 3.41 us +- 0.07 us: 1.01x slower (+1%)
- regex_v8: 27.0 ms +- 0.2 ms -> 27.3 ms +- 0.1 ms: 1.01x slower (+1%)
- pickle_dict: 29.7 us +- 0.1 us -> 30.1 us +- 0.2 us: 1.01x slower (+1%)

Faster (10):

go: 256 ms +- 3 ms -> 246 ms +- 2 ms: 1.04x faster (-4%)
pickle_pure_python: 497 us +- 5 us -> 478 us +- 5 us: 1.04x faster (-4%)
logging_simple: 9.84 us +- 0.17 us -> 9.48 us +- 0.22 us: 1.04x faster (-4%)
mako: 17.1 ms +- 0.2 ms -> 16.5 ms +- 0.2 ms: 1.03x faster (-3%)
logging_format: 10.9 us +- 0.2 us -> 10.6 us +- 0.2 us: 1.03x faster (-3%)
logging_silent: 201 ns +- 6 ns -> 198 ns +- 5 ns: 1.02x faster (-2%)
meteor_contest: 125 ms +- 1 ms -> 123 ms +- 0 ms: 1.02x faster (-2%)
json_loads: 32.0 us +- 0.3 us -> 31.6 us +- 0.2 us: 1.01x faster (-1%)
sqlalchemy_imperative: 37.8 ms +- 0.9 ms -> 37.4 ms +- 1.0 ms: 1.01x faster (-1%)
unpickle: 14.5 us +- 0.3 us -> 14.3 us +- 0.1 us: 1.01x faster (-1%)

Benchmark hidden because not significant (35): 2to3, chameleon, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, genshi_text, hexiom, json_dumps, nbody, pickle, pickle_list, pidigits, pyflate, python_startup, python_startup_no_site, raytrace, regex_compile, regex_dna, richards, scimark_lu, scimark_sor, sqlalchemy_declarative, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00 (slower)

pablogsal commented 3 years ago

Oh, I am quite confused about what's going on with pidigits and regex_v8. I will try to run the benchmarks again. Did you compare the current master against the PR? If that's the case we should rebase the PR it first to make sure we are comparing it correctly.

pablogsal commented 3 years ago

OK, I have repeated the benchmarks after rebasing and this is what I get:

venv ❯ python -m pyperf compare_to json/2020-12-16_11-20-master-8203c73f3bb1.json.gz json/2020-12-16_11-22-load_method-21b1566125b3.json.gz -G --min-speed=1 Slower (13):

regex_v8: 30.5 ms +- 0.1 ms -> 32.8 ms +- 0.2 ms: 1.08x slower (+8%)
pidigits: 253 ms +- 0 ms -> 264 ms +- 0 ms: 1.05x slower (+5%)
richards: 84.1 ms +- 1.1 ms -> 87.4 ms +- 1.4 ms: 1.04x slower (+4%)
fannkuch: 573 ms +- 2 ms -> 589 ms +- 14 ms: 1.03x slower (+3%)
regex_effbot: 3.87 ms +- 0.03 ms -> 3.98 ms +- 0.05 ms: 1.03x slower (+3%)
scimark_sor: 223 ms +- 4 ms -> 227 ms +- 2 ms: 1.02x slower (+2%)
pathlib: 20.9 ms +- 0.2 ms -> 21.3 ms +- 0.3 ms: 1.02x slower (+2%)
regex_compile: 209 ms +- 1 ms -> 212 ms +- 3 ms: 1.01x slower (+1%)
xml_etree_process: 93.7 ms +- 0.7 ms -> 94.9 ms +- 1.1 ms: 1.01x slower (+1%)
nqueens: 122 ms +- 1 ms -> 123 ms +- 1 ms: 1.01x slower (+1%)
regex_dna: 263 ms +- 1 ms -> 266 ms +- 1 ms: 1.01x slower (+1%)
django_template: 56.1 ms +- 0.4 ms -> 56.7 ms +- 0.4 ms: 1.01x slower (+1%)
raytrace: 566 ms +- 7 ms -> 572 ms +- 6 ms: 1.01x slower (+1%)

Faster (15):

unpack_sequence: 80.4 ns +- 0.9 ns -> 70.8 ns +- 0.8 ns: 1.14x faster (-12%)
scimark_sparse_mat_mult: 6.26 ms +- 0.02 ms -> 5.97 ms +- 0.06 ms: 1.05x faster (-5%)
pickle_pure_python: 547 us +- 5 us -> 523 us +- 5 us: 1.05x faster (-4%)
unpickle: 17.7 us +- 0.1 us -> 17.0 us +- 0.2 us: 1.04x faster (-4%)
mako: 20.0 ms +- 0.1 ms -> 19.5 ms +- 0.1 ms: 1.02x faster (-2%)
unpickle_pure_python: 361 us +- 4 us -> 353 us +- 4 us: 1.02x faster (-2%)
logging_simple: 9.59 us +- 0.14 us -> 9.39 us +- 0.12 us: 1.02x faster (-2%)
scimark_fft: 462 ms +- 4 ms -> 455 ms +- 4 ms: 1.02x faster (-2%)
chameleon: 11.6 ms +- 0.2 ms -> 11.4 ms +- 0.1 ms: 1.02x faster (-2%)
pickle: 13.0 us +- 0.1 us -> 12.8 us +- 0.1 us: 1.02x faster (-1%)
telco: 8.30 ms +- 0.23 ms -> 8.18 ms +- 0.14 ms: 1.01x faster (-1%)
go: 277 ms +- 2 ms -> 274 ms +- 2 ms: 1.01x faster (-1%)
pickle_list: 5.30 us +- 0.07 us -> 5.23 us +- 0.06 us: 1.01x faster (-1%)
logging_format: 10.3 us +- 0.1 us -> 10.2 us +- 0.1 us: 1.01x faster (-1%)
meteor_contest: 136 ms +- 0 ms -> 134 ms +- 1 ms: 1.01x faster (-1%)

Benchmark hidden because not significant (32): 2to3, chaos, crypto_pyaes, deltablue, dulwich_log, float, genshi_text, genshi_xml, hexiom, json_dumps, json_loads, logging_silent, nbody, pickle_dict, pyflate, python_startup, python_startup_no_site, scimark_lu, scimark_monte_carlo, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

pablogsal commented 3 years ago

I think I am closing the PR as it seems that the gains are not good enough (and there is quite a lot of noise by comparing the benchmarks together).

1st1 commented 3 years ago

I think I am closing the PR as it seems that the gains are not good enough (and there is quite a lot of noise by comparing the benchmarks together).

IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.

pablogsal commented 3 years ago

IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.

I will try to do some prototyping around that to see how much can we gain in that route. In any case, adding LOAD_METHOD support for all kinds of calls should be an improvement by itself even without caching, no?

1st1 commented 3 years ago

I will try to do some prototyping around that to see how much can we gain in that route. In any case, adding LOAD_METHOD support for all kinds of calls should be an improvement by itself even without caching, no?

Exactly.

As one argument for generalizing of LOAD_METHOD is that I, for example, try not to use kwargs in hot paths because I know it will be slower, which feels very wrong. I shouldn't care of internal implementation details of CPython and focus on writing readable code.

gvanrossum commented 3 years ago

I've skimmed several papers by Stefan Brunthaler about something called Quickening that I first found via the Pyston blog, which linked to an ancient bpo issue with a patch created by Stefan (https://bugs.python.org/issue14757).

The gist seems to be to have extra opcodes that only work for certain situations (e.g. INT_BINARY_ADD). In a hot function we can rewrite opcodes with their specialized counterpart. The new opcode contains a guard that rewrites itself back if the guard fails (and then it stays unoptimized).

I think the idea is that it's pretty cheap to keep an extra pointer to a "alternate bytecode" array.

1st1 commented 3 years ago

The gist seems to be to have extra opcodes that only work for certain situations (e.g. INT_BINARY_ADD). In a hot function we can rewrite opcodes with their specialized counterpart. The new opcode contains a guard that rewrites itself back if the guard fails (and then it stays unoptimized).

This is also roughly what I suggested in https://bugs.python.org/msg379333. Except that I don't think it's necessary to add new opcodes.

gvanrossum commented 3 years ago

Do we have good intuition or data about which operations need speeding up most? Everybody always assumes it's BINARY_ADD, but much Python code isn't actually numeric, and binary operations aren't all that common. (For example, IIRC at my previous employer the hottest code was some kind of web template expansion and another bottleneck was marshalling RPC requests and results.)

Optimizers usually evolve to be excellent at optimizing the benchmark du jour -- what benchmark do we want to use?

1st1 commented 3 years ago

Do we have good intuition or data about which operations need speeding up most? Everybody always assumes it's BINARY_ADD, but much Python code isn't actually numeric, and binary operations aren't all that common.

IMO, we shouldn't focus too much on optimizing binops. Regular webapps/network kind of code wouldn't benefit from that, and scientific code that uses numpy/ml libraries already offsets most of expensive computation to outside of CPython eval. Instead, I think, we should continue focusing on lowering the cost of function/method calls and attribute access.

pablogsal commented 3 years ago

I subscribe that but is also a matter of what are the optimizations with higher ratio impact/(cost+complexity). Caching the function pointers on binary operations or (better IMHO) comparisons is likely a good candidate

gvanrossum commented 3 years ago

Undoubtedly -- but impact should be measured on what typical users would see, not on narrow benchmarks.

pablogsal commented 3 years ago

Agreed. In that regard the most standard thing that we have is the pyperformance suite, which are almost all macro benchmarks. Is also what is displayed in speed.python.org

gvanrossum commented 3 years ago

I had a simpler idea for an inline cache for LOAD_METHOD than python/cpython#67691. The essential part goes like this (sorry for the hybrid C/Python):

if <optimized etc.>:
    if type == lm->type and type->tp_version_tag == lm->tp_version_tag:
        meth = lm->meth
        SET_TOP(meth)
        PUSH(obj)
        DISPATCH()

name = GETITEM(names, oparg)
meth_found = _PyObject_GetMethod(obj, name, &meth)
<error check>

if meth_found:
    SET_TOP(meth)
    PUSH(obj)
    if <optimizing etc.>:
        lm = ...
        lm->type = type
        lm->meth = meth

\<etc.>

What am I missing? Why is the hash of the name needed?

Oh, it's probably because there could still be an overriding value in obj.__dict__. But certainly we could check for type == lm->type before the other checks (HasFeature and tp_version_tag).

But what if we only did this for classes without an instance dict? That could work for things like list.append and str.find.

pablogsal commented 3 years ago

What am I missing? Why is the hash of the name needed?

To speed up the call to get the method from the dictionary using _PyDict_GetItem_KnownHash. The reason I was not caching the method is that as you mention there could still be an overriding value in the dictionary.

But what if we only did this for classes without an instance dict?

This is an interesting idea. For PR23503 seems that the machinery was too costly that it was killing the advantages, but maybe we could simplify this for classes without dictionaries so we can still gain overall.

pablogsal commented 3 years ago

I am attaching to this issue a patch with PR 23503 restricted to only classes without dict. I can measure a speedup but is super small (5%) in a function that keeps calling a method for a builtin:

def foo(obj):
    for _ in range(10000):
        res = obj.index(42)
    return res

gvanrossum commented 3 years ago

Thanks!

The loop overhead is presumably dramatic in that example. I think I measured a somewhat bigger speedup using timeit, which IIRC subtracts the cost of an empty loop. But you're right that 5% on a micro-benchmark is not very exciting.

I wonder if there are cases where the speedup is larger, e.g. in a class with __slots where the method is defined in a base class (also with __slots). In that case LOAD_METHOD would require two dict lookups, both of which we'd avoid here.

This could be relevant if we had a hidden class mechanism (I'm playing with that but it's too soon to show anything). Let's keep your patch around for that.

Has anyone looked at storing the cache literally inline? If we had a way to rewrite bytecode that allowed us to move code (requiring us to fix up jumps) the cache data could be stored inline in the bytecode array (I guess that's why they call it "inline cache" :-). So e.g. LOAD_METHOD could have a layout like this:

LOAD_METHOD opcode oparg (padding) optimized type tp_version_tag meth

This would avoid the pointer chasing needed to find the opcache struct, and increase memory locality. Obviously the bytecode would have to be copied to mutable memory.

We could even make it so that "LOAD_METHOD with opcache entry" is a new opcode, so there would be less overhead for unoptimized code (and for optimized code :-).

We'd have to do something funky for concurrent invocations of the same function (due to threading or recursion), but that can be dealt with; e.g., the frame could hold a pointer to the bytecode array in use.

Fidget-Spinner commented 3 years ago

IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.

Recently I played around with that idea and extended LOAD/CALL_METHOD to keyword arguments (so CALL_FUNCTION_KW is replaced). I then reapplied Pablo's patch. Here's the pyperformance results (LTO + PGO):

pyperf compare_to 2021-05-08_03-25-general_load_method-42fcad26a487.json.gz 2021-05-08_08-50-general_load_method-dd24d58ce940.json.gz -G --min-speed=1

Slower (5):

pathlib: 50.3 ms +- 0.8 ms -> 51.9 ms +- 1.1 ms: 1.03x slower
unpack_sequence: 141 ns +- 2 ns -> 144 ns +- 7 ns: 1.02x slower
telco: 16.6 ms +- 0.6 ms -> 16.9 ms +- 0.5 ms: 1.02x slower
unpickle: 33.4 us +- 0.5 us -> 33.9 us +- 0.8 us: 1.01x slower
nqueens: 239 ms +- 2 ms -> 242 ms +- 2 ms: 1.01x slower

Faster (23):

logging_silent: 438 ns +- 16 ns -> 411 ns +- 18 ns: 1.07x faster
go: 572 ms +- 55 ms -> 539 ms +- 8 ms: 1.06x faster
pickle_pure_python: 1.10 ms +- 0.01 ms -> 1.06 ms +- 0.02 ms: 1.04x faster
meteor_contest: 274 ms +- 4 ms -> 265 ms +- 4 ms: 1.03x faster
logging_simple: 20.6 us +- 0.3 us -> 19.9 us +- 0.4 us: 1.03x faster
hexiom: 23.0 ms +- 1.0 ms -> 22.2 ms +- 0.2 ms: 1.03x faster
mako: 37.0 ms +- 0.5 ms -> 35.9 ms +- 0.4 ms: 1.03x faster
json_dumps: 32.3 ms +- 0.4 ms -> 31.4 ms +- 0.3 ms: 1.03x faster
scimark_sor: 496 ms +- 7 ms -> 484 ms +- 7 ms: 1.03x faster
spectral_norm: 372 ms +- 5 ms -> 363 ms +- 4 ms: 1.02x faster
richards: 177 ms +- 5 ms -> 173 ms +- 4 ms: 1.02x faster
django_template: 114 ms +- 1 ms -> 112 ms +- 1 ms: 1.02x faster
dulwich_log: 170 ms +- 2 ms -> 167 ms +- 2 ms: 1.02x faster
pyflate: 1.65 sec +- 0.02 sec -> 1.62 sec +- 0.02 sec: 1.02x faster
xml_etree_iterparse: 282 ms +- 5 ms -> 278 ms +- 3 ms: 1.02x faster
unpickle_list: 12.1 us +- 0.1 us -> 11.9 us +- 0.2 us: 1.02x faster
nbody: 343 ms +- 5 ms -> 338 ms +- 7 ms: 1.02x faster
logging_format: 22.6 us +- 0.4 us -> 22.3 us +- 0.5 us: 1.01x faster
sympy_str: 833 ms +- 7 ms -> 822 ms +- 7 ms: 1.01x faster
float: 271 ms +- 3 ms -> 268 ms +- 5 ms: 1.01x faster
scimark_fft: 1.05 sec +- 0.01 sec -> 1.03 sec +- 0.01 sec: 1.01x faster
deltablue: 17.9 ms +- 0.3 ms -> 17.6 ms +- 0.4 ms: 1.01x faster
scimark_lu: 403 ms +- 9 ms -> 399 ms +- 9 ms: 1.01x faster

Benchmark hidden because not significant (30): (...) Geometric mean: 1.01x faster

Most benchmarks don't use keyword arguments. I'm going to try extend LOAD_METHOD to *args and **kwargs next.

For anyone interested, here's the diff (unstable) https://github.com/python/cpython/compare/main...Fidget-Spinner:general_load_method

I hope this helps :).

gvanrossum commented 3 years ago

Moving the needle on the pyperformance benchmarks is really hard!

pablogsal commented 2 years ago

Closing this as we have been implementing this idea already with the adaptative interpreter

Fidget-Spinner commented 2 years ago

For future reference, the following opcodes specialized via the PEP-659 specializing adaptive interpreter:

LOAD_GLOBAL bpo-44338
LOAD_ATTR bpo-44337
STORE_ATTR bpo-44826 (2% faster pyperformance)
BINARY_SUBSCR bpo-26280 (2% faster pyperformance)
LOAD_METHOD bpo-44889 (2% faster pyperformance)
CALL_FUNCTION bpo-44525 (2% faster pyperformance)
BINARY_ADD bpo-44945 (1% faster pyperformance)
BINARY_MULTIPLY bpo-45367 (10% faster nbody, 0% faster pyperformance)

Combined instructions: bpo-44900 (2% faster pyperformance)

python / cpython

Caching infrastructure for the evaluation loop: specialised opcodes #86281