Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty

df79943f-4aee-4531-a00d-c6b12816eb70 commented 6 years ago

BPO	34396
Nosy	@rhettinger, @scoder, @benjaminp, @methane, @serhiy-storchaka, @jdemeyer, @mr-nfamous, @Marco-Sulla
PRs	python/cpython#14863

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-C-API', '3.9', 'performance'] title = 'Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty' updated_at = user = 'https://github.com/mr-nfamous' ``` bugs.python.org fields: ```python activity = actor = 'methane' assignee = 'none' closed = False closed_date = None closer = None components = ['C API'] creation = creator = 'bup' dependencies = [] files = [] hgrepos = [] issue_num = 34396 keywords = ['patch'] message_count = 8.0 messages = ['323490', '323498', '328187', '339449', '339453', '339477', '348185', '362715'] nosy_count = 8.0 nosy_names = ['rhettinger', 'scoder', 'benjamin.peterson', 'methane', 'serhiy.storchaka', 'jdemeyer', 'bup', 'Marco Sulla'] pr_nums = ['14863'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'performance' url = 'https://bugs.python.org/issue34396' versions = ['Python 3.9'] ```

df79943f-4aee-4531-a00d-c6b12816eb70 commented 6 years ago

The ones I know of are list.__getitem, dict __contains & __getitem, and (frozen)set.__contains but from what I can tell so far it seems like dict.__getitem__ takes the worst hit. For dicts, I've spent a few hours trying to figure out what's getting put into the heap type's mp_subscript slot to slow it down so drastically but I just can't figure it out.

So I just forced dict_subscript into the heap type's mp_subscript slot to confirm the rather obvious impossibility of it breaking anything. Here's a "little" timeit script to demonstrate how horribly inefficient dict.__getitem__ is when called on anything other than an extension type:

if __name__ == '__main__':

    import timeit
    from collections import OrderedDict as FastDictSub
    from ctypes import *

    class mappingmethods(Structure):
        _fields_ = [
            ('mp_length', c_void_p),
            ('mp_subscript', PYFUNCTYPE(py_object,py_object,py_object)),
            ('mp_ass_subscript', c_void_p)]

    class _type_object(Structure):
        _fields_ = [('junk', (c_uint8*56)), # <- brevity
                    ('tp_as_mapping', POINTER(mappingmethods))]
    py_type = POINTER(_type_object)

    class SlowDictSub(dict):
        pass

assert SlowDictSub.__base__ is dict and FastDictSub.__base__ is dict
assert SlowDictSub.__getitem__ is dict.__getitem__
assert FastDictSub.__getitem__ is dict.__getitem__

    mp = dict.fromkeys(range(15))
    setup = 'from __main__ import mp, %s as Dict; d=Dict(mp)'
    print(f'Comparing execution time of heap allocated dict subtype '
          f'versus PyODict_Type (who also inherits dict.__getitem__)...\n')
    slown, slowt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange()
    print(f'avg. exec {slown/slowt:_.0f} SlowDictSub[x] statements per second')
    fastn, fastt = timeit.Timer('d[10]', setup%'FastDictSub').autorange()
    print(f'avg. exec {fastn/fastt:_.0f} OrderedDict[x] statements per second')
    print()
    print(f'SlowDictSub was {1/(fastt/slowt):.2f}x slower than OrderedDict... '
          f"Let's see what happens when we fix SlowDictSub's 'broken' "
          "mp_subscript slot:\n")
    slow_tp = cast(id(SlowDictSub), py_type).contents
    fast_tp = cast(id(FastDictSub), py_type).contents

    slow_tp.tp_as_mapping.contents.mp_subscript = (
        fast_tp.tp_as_mapping.contents.mp_subscript)
    postn, postt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange()
    print(f'avg. exec {postn/postt:_.0f} SlowDictSub[x] after slot change')
    print(f'which is now {1/(fastt/postt):.2}x the speed of dict_item')

Comparing execution time of heap allocated dict subtype versus PyODictType (who also inherits dict.\_getitem__)...

avg. exec 6_220_709 SlowDictSub[x] statements per second avg. exec 21_894_971 OrderedDict[x] statements per second

SlowDictSub was 3.52x slower than OrderedDict... Let's see what happens when we fix SlowDictSub's 'broken' mp_subscript slot:

avg. exec 21_665_595 SlowDictSub[x] after slot change which is now 1.0x the speed of dict_item

benjaminp commented 6 years ago

Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast: https://github.com/python/cpython/commit/8f5cdaa784f555149adf5e94fd2e989f99d6b1db

Reverting the dictobject.c changes of the commit, i.e., the following patch, fixes your example.

diff --git a/Objects/dictobject.c b/Objects/dictobject.c
index 413557d667..d85834977d 100644
--- a/Objects/dictobject.c
+++ b/Objects/dictobject.c
@@ -3032,8 +3032,6 @@ dict_sizeof(PyDictObject *mp, PyObject *Py_UNUSED(ignored))
     return PyLong_FromSsize_t(_PyDict_SizeOf(mp));
 }

-PyDoc_STRVAR(getitem__doc__, "x.__getitem__(y) <==> x[y]");
-
 PyDoc_STRVAR(sizeof__doc__,
 "D.__sizeof__() -> size of D in memory, in bytes");

@@ -3071,8 +3069,6 @@ PyDoc_STRVAR(values__doc__,

 static PyMethodDef mapp_methods[] = {
     DICT___CONTAINS___METHODDEF
-    {"__getitem__", (PyCFunction)dict_subscript,        METH_O | METH_COEXIST,
-     getitem__doc__},
     {"__sizeof__",      (PyCFunction)dict_sizeof,       METH_NOARGS,
      sizeof__doc__},
     DICT_GET_METHODDEF

df79943f-4aee-4531-a00d-c6b12816eb70 commented 5 years ago

Well, I found another mystery. Calling tuple.__getitem directly, on an actual tuple instance, is 50-80% slower than calling list.__getitem even though in theory, they should be virtually identical with maybe tuple access being infinitesimally be faster.

The only difference I found here between the two __getitems surprisingly is that tuple *doesn't* have __getitem in its method list, while list does.

>>> [].__getitem__
<built-in method __getitem__ of list object at 0x012EE2D8>
>>> ().__getitem__
<method-wrapper '__getitem__' of tuple object at 0x00130030>

Since list is using the wrapper in tp_methods and tuple isn't, it *look like METH_COEXIST *is helping but since we know that in both the dict case and now the tuple/list case it is unnecessary, there's clearly a bug somewhere causing multiple arg tuple allocs and or arg parsing. Unfortunately, it takes me several hours to "catch up" from CALL_FUNCTION in ceval.c to something like call_method in typeobject.c and I don't have that kind of time at the moment but hopefully someone notices this again.

jdemeyer commented 5 years ago

Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast: https://github.com/python/cpython/commit/8f5cdaa784f555149adf5e94fd2e989f99d6b1db

But what's the use case of making somedict.__getitem__(x) fast? One should just write somedict[x] instead.

If PEP-580 or PEP-590 is accepted, it should be easy to simply make wrapper descriptors such as somedict.__getitem__ faster, removing the need for METH_COEXIST.

serhiy-storchaka commented 5 years ago

But what's the use case of making somedict.__getitem__(x) fast? One should just write somedict[x] instead.

Using them in itertools functions. somedict.__getitem__ is faster that lambda: somedict[x].

As for map() and filter(), comprehensions are almost as fast as map() and filter() with somedict.__getitem__ and are more readable.

jdemeyer commented 5 years ago

OK, makes sense. Also super() calls I guess: you can write super().__getitem__(x) but not super()[x] (although the latter *could* be implemented if we wanted to).

I see two ways of fixing this:

Make wrapper descriptors faster, removing the need for the METH_COEXIST hack in the first place.
Deal better with METH_COEXIST methods when inheriting.

Since both of these solutions are closely related to the topic of PEP 580/PEP 590, I suggest to try to fix this after either is implemented.

rhettinger commented 5 years ago

Some thoughs:

Lib/sets.py benefitted significantly from itertools and METH_COEXIST.
The dunder methods are part of the public API. Serhiy's persona tastes aside, there is no reason not to use them like any other method.
Given a performance trade-off between classes and subclasses, almost universally we give preference to the parent class.
The folks who use itertools with __getitem and __contains almost always do so because they care about speed and have a preference for a functional style. Removing the current optimization would directly hit those folks and their style of coding, resulting in code that used to be fast and elegant becoming slow.

6740fb1f-e34b-433e-a958-b77f9c4427e8 commented 4 years ago

I asked why on StackOverflow, and an user seemed to find the reason. The problem for him/her is in update_one_slot().

dict implements directly __contains__() and __getitem__(). Usually, sq_contains and mp_subscript are wrapped to implement __contains__() and __getitem__(), but this way dict is a little faster.

The problem is that update_one_slot() searches for the wrappers. If it does not find them, it does not inherit the __contains__() and __getitem__() of the class, but create a __contains__() and __getitem__() functions that do an MRO search and call the superclass method. This is why __contains__() and __getitem__() of dict subclasses are slower.

Is it possible to modify update_one_slot() so that, if no wrapper is found, the explicit implementation is inherited?

SO answer: https://stackoverflow.com/a/59914459/1763602

python / cpython

Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty #78577