python / cpython

The Python programming language
https://www.python.org
Other
61.8k stars 29.71k forks source link

Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty #78577

Open df79943f-4aee-4531-a00d-c6b12816eb70 opened 6 years ago

df79943f-4aee-4531-a00d-c6b12816eb70 commented 6 years ago
BPO 34396
Nosy @rhettinger, @scoder, @benjaminp, @methane, @serhiy-storchaka, @jdemeyer, @mr-nfamous, @Marco-Sulla
PRs
  • python/cpython#14863
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-C-API', '3.9', 'performance'] title = 'Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty' updated_at = user = 'https://github.com/mr-nfamous' ``` bugs.python.org fields: ```python activity = actor = 'methane' assignee = 'none' closed = False closed_date = None closer = None components = ['C API'] creation = creator = 'bup' dependencies = [] files = [] hgrepos = [] issue_num = 34396 keywords = ['patch'] message_count = 8.0 messages = ['323490', '323498', '328187', '339449', '339453', '339477', '348185', '362715'] nosy_count = 8.0 nosy_names = ['rhettinger', 'scoder', 'benjamin.peterson', 'methane', 'serhiy.storchaka', 'jdemeyer', 'bup', 'Marco Sulla'] pr_nums = ['14863'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'performance' url = 'https://bugs.python.org/issue34396' versions = ['Python 3.9'] ```

    df79943f-4aee-4531-a00d-c6b12816eb70 commented 6 years ago

    The ones I know of are list.__getitem, dict __contains & __getitem, and (frozen)set.__contains but from what I can tell so far it seems like dict.__getitem__ takes the worst hit. For dicts, I've spent a few hours trying to figure out what's getting put into the heap type's mp_subscript slot to slow it down so drastically but I just can't figure it out.

    So I just forced dict_subscript into the heap type's mp_subscript slot to confirm the rather obvious impossibility of it breaking anything. Here's a "little" timeit script to demonstrate how horribly inefficient dict.__getitem__ is when called on anything other than an extension type:

    if __name__ == '__main__':
    
        import timeit
        from collections import OrderedDict as FastDictSub
        from ctypes import *
    
        class mappingmethods(Structure):
            _fields_ = [
                ('mp_length', c_void_p),
                ('mp_subscript', PYFUNCTYPE(py_object,py_object,py_object)),
                ('mp_ass_subscript', c_void_p)]
    
        class _type_object(Structure):
            _fields_ = [('junk', (c_uint8*56)), # <- brevity
                        ('tp_as_mapping', POINTER(mappingmethods))]
        py_type = POINTER(_type_object)
    
        class SlowDictSub(dict):
            pass
    assert SlowDictSub.__base__ is dict and FastDictSub.__base__ is dict
    assert SlowDictSub.__getitem__ is dict.__getitem__
    assert FastDictSub.__getitem__ is dict.__getitem__
        mp = dict.fromkeys(range(15))
        setup = 'from __main__ import mp, %s as Dict; d=Dict(mp)'
        print(f'Comparing execution time of heap allocated dict subtype '
              f'versus PyODict_Type (who also inherits dict.__getitem__)...\n')
        slown, slowt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange()
        print(f'avg. exec {slown/slowt:_.0f} SlowDictSub[x] statements per second')
        fastn, fastt = timeit.Timer('d[10]', setup%'FastDictSub').autorange()
        print(f'avg. exec {fastn/fastt:_.0f} OrderedDict[x] statements per second')
        print()
        print(f'SlowDictSub was {1/(fastt/slowt):.2f}x slower than OrderedDict... '
              f"Let's see what happens when we fix SlowDictSub's 'broken' "
              "mp_subscript slot:\n")
        slow_tp = cast(id(SlowDictSub), py_type).contents
        fast_tp = cast(id(FastDictSub), py_type).contents
    
        slow_tp.tp_as_mapping.contents.mp_subscript = (
            fast_tp.tp_as_mapping.contents.mp_subscript)
        postn, postt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange()
        print(f'avg. exec {postn/postt:_.0f} SlowDictSub[x] after slot change')
        print(f'which is now {1/(fastt/postt):.2}x the speed of dict_item')

    Comparing execution time of heap allocated dict subtype versus PyODictType (who also inherits dict.\_getitem__)...

    avg. exec 6_220_709 SlowDictSub[x] statements per second avg. exec 21_894_971 OrderedDict[x] statements per second

    SlowDictSub was 3.52x slower than OrderedDict... Let's see what happens when we fix SlowDictSub's 'broken' mp_subscript slot:

    avg. exec 21_665_595 SlowDictSub[x] after slot change which is now 1.0x the speed of dict_item

    benjaminp commented 6 years ago

    Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast: https://github.com/python/cpython/commit/8f5cdaa784f555149adf5e94fd2e989f99d6b1db

    Reverting the dictobject.c changes of the commit, i.e., the following patch, fixes your example.

    diff --git a/Objects/dictobject.c b/Objects/dictobject.c
    index 413557d667..d85834977d 100644
    --- a/Objects/dictobject.c
    +++ b/Objects/dictobject.c
    @@ -3032,8 +3032,6 @@ dict_sizeof(PyDictObject *mp, PyObject *Py_UNUSED(ignored))
         return PyLong_FromSsize_t(_PyDict_SizeOf(mp));
     }
    
    -PyDoc_STRVAR(getitem__doc__, "x.__getitem__(y) <==> x[y]");
    -
     PyDoc_STRVAR(sizeof__doc__,
     "D.__sizeof__() -> size of D in memory, in bytes");
    
    @@ -3071,8 +3069,6 @@ PyDoc_STRVAR(values__doc__,
    
     static PyMethodDef mapp_methods[] = {
         DICT___CONTAINS___METHODDEF
    -    {"__getitem__", (PyCFunction)dict_subscript,        METH_O | METH_COEXIST,
    -     getitem__doc__},
         {"__sizeof__",      (PyCFunction)dict_sizeof,       METH_NOARGS,
          sizeof__doc__},
         DICT_GET_METHODDEF
    df79943f-4aee-4531-a00d-c6b12816eb70 commented 5 years ago

    Well, I found another mystery. Calling tuple.__getitem directly, on an actual tuple instance, is 50-80% slower than calling list.__getitem even though in theory, they should be virtually identical with maybe tuple access being infinitesimally be faster.

    The only difference I found here between the two __getitems surprisingly is that tuple *doesn't* have __getitem in its method list, while list does.

    >>> [].__getitem__
    <built-in method __getitem__ of list object at 0x012EE2D8>
    >>> ().__getitem__
    <method-wrapper '__getitem__' of tuple object at 0x00130030>

    Since list is using the wrapper in tp_methods and tuple isn't, it *look like METH_COEXIST *is helping but since we know that in both the dict case and now the tuple/list case it is unnecessary, there's clearly a bug somewhere causing multiple arg tuple allocs and or arg parsing. Unfortunately, it takes me several hours to "catch up" from CALL_FUNCTION in ceval.c to something like call_method in typeobject.c and I don't have that kind of time at the moment but hopefully someone notices this again.

    jdemeyer commented 5 years ago

    Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast: https://github.com/python/cpython/commit/8f5cdaa784f555149adf5e94fd2e989f99d6b1db

    But what's the use case of making somedict.__getitem__(x) fast? One should just write somedict[x] instead.

    If PEP-580 or PEP-590 is accepted, it should be easy to simply make wrapper descriptors such as somedict.__getitem__ faster, removing the need for METH_COEXIST.

    serhiy-storchaka commented 5 years ago

    But what's the use case of making somedict.__getitem__(x) fast? One should just write somedict[x] instead.

    Using them in itertools functions. somedict.__getitem__ is faster that lambda: somedict[x].

    As for map() and filter(), comprehensions are almost as fast as map() and filter() with somedict.__getitem__ and are more readable.

    jdemeyer commented 5 years ago

    OK, makes sense. Also super() calls I guess: you can write super().__getitem__(x) but not super()[x] (although the latter *could* be implemented if we wanted to).

    I see two ways of fixing this:

    1. Make wrapper descriptors faster, removing the need for the METH_COEXIST hack in the first place.

    2. Deal better with METH_COEXIST methods when inheriting.

    Since both of these solutions are closely related to the topic of PEP 580/PEP 590, I suggest to try to fix this after either is implemented.

    rhettinger commented 5 years ago

    Some thoughs:

    6740fb1f-e34b-433e-a958-b77f9c4427e8 commented 4 years ago

    I asked why on StackOverflow, and an user seemed to find the reason. The problem for him/her is in update_one_slot().

    dict implements directly __contains__() and __getitem__(). Usually, sq_contains and mp_subscript are wrapped to implement __contains__() and __getitem__(), but this way dict is a little faster.

    The problem is that update_one_slot() searches for the wrappers. If it does not find them, it does not inherit the __contains__() and __getitem__() of the class, but create a __contains__() and __getitem__() functions that do an MRO search and call the superclass method. This is why __contains__() and __getitem__() of dict subclasses are slower.

    Is it possible to modify update_one_slot() so that, if no wrapper is found, the explicit implementation is inherited?

    SO answer: https://stackoverflow.com/a/59914459/1763602