python / cpython

The Python programming language
https://www.python.org
Other
63.51k stars 30.42k forks source link

Investigate slow writes to class variables #80193

Closed rhettinger closed 5 years ago

rhettinger commented 5 years ago
BPO 36012
Nosy @nascheme, @rhettinger, @scoder, @pablogsal, @0xYasser
PRs
  • python/cpython#11907
  • Files
  • perf-screenshot.png
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', '3.8', 'performance'] title = 'Investigate slow writes to class variables' updated_at = user = 'https://github.com/rhettinger' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'none' closed = True closed_date = closer = 'rhettinger' components = ['Interpreter Core'] creation = creator = 'rhettinger' dependencies = [] files = ['48157'] hgrepos = [] issue_num = 36012 keywords = ['patch'] message_count = 11.0 messages = ['335714', '335733', '335747', '335763', '335776', '335794', '335795', '335992', '335995', '336031', '336129'] nosy_count = 5.0 nosy_names = ['nascheme', 'rhettinger', 'scoder', 'pablogsal', 'Yasser Alshalaan'] pr_nums = ['11907'] priority = 'low' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue36012' versions = ['Python 3.8'] ```

    rhettinger commented 5 years ago

    Benchmark show what writes to class variables are anomalously slow.

        class A(object):
            pass
    
        A.x = 1   # This write is 3 to 5 times slower than other writes.

    FWIW, the same operation for old-style classes in Python 2.7 was several times faster.

    We should investigate to understand why the writes are so slow. There might be a good reason or there might be an opportunity for optimization.

    -------------------------------------------------

    $ python3.8 Tools/scripts/var_access_benchmark.py
    Variable and attribute read access:
       4.3 ns   read_local
       4.6 ns   read_nonlocal
      14.5 ns   read_global
      19.0 ns   read_builtin
      18.4 ns   read_classvar_from_class
      16.2 ns   read_classvar_from_instance
      24.7 ns   read_instancevar
      19.7 ns   read_instancevar_slots
      19.5 ns   read_namedtuple
      26.4 ns   read_boundmethod

    Variable and attribute write access: 4.4 ns write_local 5.1 ns write_nonlocal 18.2 ns write_global 103.9 ns write_classvar \<== Outlier 35.4 ns write_instancevar 25.6 ns write_instancevar_slots

    pablogsal commented 5 years ago

    It seems 50% of the overhead (50ns) is due to two reasons:

    res = _PyObject_GenericSetAttrWithDict((PyObject *)type, name, value, NULL);

    by

    PyObject* dictptr = _PyObject_GetDictPtr(type);
    res = _PyObjectDict_SetItem(type, dictptr, name, value);

    and delete the update_slot(type, name) call afterwards, the times are reduced to 50ns.

    5e3abc84-47aa-4662-964d-fc66af7a9f69 commented 5 years ago

    can you include your python 2.7 runs? for me it looks similar

    rhettinger commented 5 years ago

    can you include your python 2.7 runs? for me it looks similar

    It will give similar results unless you switch to old-style classes (edit out the inheritance from object).

        class A:
            pass
    
        A.x = 1

    $ python2.7 var_access_benchmark.py
    Variable and attribute read access:
       6.7 ns   read_local
      10.9 ns   read_global
      18.9 ns   read_builtin
      24.4 ns   read_classvar_from_class
      30.2 ns   read_classvar_from_instance
      22.7 ns   read_instancevar
      25.5 ns   read_instancevar_slots
      99.3 ns   read_namedtuple
      40.9 ns   read_boundmethod

    Variable and attribute write access: 8.2 ns write_local 18.7 ns write_global 32.9 ns write_classvar \<== Not formerly an outlier 32.5 ns write_instancevar 31.2 ns write_instancevar_slots

    scoder commented 5 years ago

    It turns out that "update_slot()" is always called, even when we are not updating a slot name (which is always a special dunder-name). The linear search for names in "updateslots()" is a huge waste of time here, and short-circuiting out of it when the name does not start with "\" cuts the overall update time by 50%. I pushed a PR.

    Another improvement would be a sub-linear algorithm for searching the slot name, but that's a bigger change.

    pablogsal commented 5 years ago

    This are the timings that I am measuring with PR 11907:

    Variable and attribute read access: 5.7 ns read_local 5.9 ns read_nonlocal 16.2 ns read_global 24.5 ns read_builtin 20.9 ns read_classvar_from_class 20.0 ns read_classvar_from_instance 29.7 ns read_instancevar 24.7 ns read_instancevar_slots 22.9 ns read_namedtuple 36.8 ns read_boundmethod

    Variable and attribute write access: 6.9 ns write_local 6.9 ns write_nonlocal 26.7 ns write_global 65.4 ns write_classvar \<----- Down from 120.6 ns 49.5 ns write_instancevar 34.5 ns write_instancevar_slots

    rhettinger commented 5 years ago

    Wow, I didn't expect to get an immediate win this of this magnitude :-)

    nascheme commented 5 years ago

    Some profiling using 'perf'. This is for cpython 63fa1cfece4912110ce3a0ff11fb3ade3ff5e756.

    children self [...] + 97.27% 0.00% run_mod (inlined) + 88.53% 6.33% PyObject_SetAttr + 79.34% 6.80% type_setattro + 43.92% 43.92% update_slot + 26.87% 5.84% _PyObject_GenericSetAttrWithDict + 14.62% 6.52% insertdict + 8.80% 8.80% lookdict_unicode_nodummy + 6.57% 0.00% _Py_DECREF (inlined) + 5.19% 5.19% PyUnicode_InternInPlace + 3.57% 3.57% _PyObjectDict_SetItem + 3.38% 3.38% _PyType_Lookup + 1.71% 0.00% _Py_INCREF (inlined) + 1.42% 0.00% _Py_XDECREF (inlined) [...]

    After applying PR 11907:

    children self [...] + 94.76% 0.00% run_mod (inlined) + 75.22% 6.77% PyObject_SetAttr + 64.65% 13.08% type_setattro + 47.51% 11.96% _PyObject_GenericSetAttrWithDict + 22.99% 13.95% insertdict + 10.10% 10.10% lookdict_unicode_nodummy + 9.47% 9.47% PyUnicode_InternInPlace + 7.10% 7.10% _PyObjectDict_SetItem + 7.02% 7.02% _PyType_Lookup + 6.52% 0.00% _Py_DECREF (inlined) + 3.17% 0.00% _Py_INCREF (inlined) + 3.16% 0.00% _Py_XDECREF (inlined) + 2.59% 0.00% PyDict_SetItem (inlined) + 1.50% 0.00% is_dunder_name (inlined) [...]

    The PyUnicode_InternInPlace() can mostly be eliminated by testing PyUnicode_CHECK_INTERNED() first (we already have called PyUnicode_Check() on it). That only gives a 7% speedup on my machine though. The is_dunder_name() is a much bigger optimization.

    nascheme commented 5 years ago

    BTW, 'perf report [...]' has a really neat annotated assembly view. Scroll to the function you are interested in and press 'a'. Press 't' to toggle the time units (left side numbers). I'm attaching a screenshot of the disassembly of the lookdict function. The time units are sample accounts. Each count is a point where the profiler woke up on that specific instruction.

    rhettinger commented 5 years ago

    Thanks Neil. The tooling does indeed look nice.

    I added your PyUnicode_InternInPlace() suggestion to the PR.

    At this point, the PR looks ready-to-go unless any of you think we've missed some low-hanging fruit.

    rhettinger commented 5 years ago

    New changeset d8b9e1fc2e45d2bc3f4a9737c375f2adb8a8c7de by Raymond Hettinger (Stefan Behnel) in branch 'master': bpo-36012: Avoid linear slot search for non-dunder methods (GH-11907) https://github.com/python/cpython/commit/d8b9e1fc2e45d2bc3f4a9737c375f2adb8a8c7de