Investigate slow writes to class variables

rhettinger commented 5 years ago

BPO	36012
Nosy	@nascheme, @rhettinger, @scoder, @pablogsal, @0xYasser
PRs	python/cpython#11907
Files	perf-screenshot.png

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', '3.8', 'performance'] title = 'Investigate slow writes to class variables' updated_at = user = 'https://github.com/rhettinger' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'none' closed = True closed_date = closer = 'rhettinger' components = ['Interpreter Core'] creation = creator = 'rhettinger' dependencies = [] files = ['48157'] hgrepos = [] issue_num = 36012 keywords = ['patch'] message_count = 11.0 messages = ['335714', '335733', '335747', '335763', '335776', '335794', '335795', '335992', '335995', '336031', '336129'] nosy_count = 5.0 nosy_names = ['nascheme', 'rhettinger', 'scoder', 'pablogsal', 'Yasser Alshalaan'] pr_nums = ['11907'] priority = 'low' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue36012' versions = ['Python 3.8'] ```

rhettinger commented 5 years ago

Benchmark show what writes to class variables are anomalously slow.

    class A(object):
        pass

    A.x = 1   # This write is 3 to 5 times slower than other writes.

FWIW, the same operation for old-style classes in Python 2.7 was several times faster.

We should investigate to understand why the writes are so slow. There might be a good reason or there might be an opportunity for optimization.

-------------------------------------------------

$ python3.8 Tools/scripts/var_access_benchmark.py
Variable and attribute read access:
   4.3 ns   read_local
   4.6 ns   read_nonlocal
  14.5 ns   read_global
  19.0 ns   read_builtin
  18.4 ns   read_classvar_from_class
  16.2 ns   read_classvar_from_instance
  24.7 ns   read_instancevar
  19.7 ns   read_instancevar_slots
  19.5 ns   read_namedtuple
  26.4 ns   read_boundmethod

Variable and attribute write access: 4.4 ns write_local 5.1 ns write_nonlocal 18.2 ns write_global 103.9 ns write_classvar \<== Outlier 35.4 ns write_instancevar 25.6 ns write_instancevar_slots

pablogsal commented 5 years ago

It seems 50% of the overhead (50ns) is due to two reasons:

30-40% is due to the call to update_slot(type, name) after the item is set in the class dictionary.
70-60% is due to all the extra work from _PyObject_GenericSetAttrWithDict until it ends calling _PyObjectDict_SetItem with the slot. If in the code of typeobject.c:type_setattro you change:

res = _PyObject_GenericSetAttrWithDict((PyObject *)type, name, value, NULL);

by

PyObject* dictptr = _PyObject_GetDictPtr(type);
res = _PyObjectDict_SetItem(type, dictptr, name, value);

and delete the update_slot(type, name) call afterwards, the times are reduced to 50ns.

5e3abc84-47aa-4662-964d-fc66af7a9f69 commented 5 years ago

can you include your python 2.7 runs? for me it looks similar

rhettinger commented 5 years ago

can you include your python 2.7 runs? for me it looks similar

It will give similar results unless you switch to old-style classes (edit out the inheritance from object).

    class A:
        pass

    A.x = 1

$ python2.7 var_access_benchmark.py
Variable and attribute read access:
   6.7 ns   read_local
  10.9 ns   read_global
  18.9 ns   read_builtin
  24.4 ns   read_classvar_from_class
  30.2 ns   read_classvar_from_instance
  22.7 ns   read_instancevar
  25.5 ns   read_instancevar_slots
  99.3 ns   read_namedtuple
  40.9 ns   read_boundmethod

Variable and attribute write access: 8.2 ns write_local 18.7 ns write_global 32.9 ns write_classvar \<== Not formerly an outlier 32.5 ns write_instancevar 31.2 ns write_instancevar_slots

scoder commented 5 years ago

It turns out that "update_slot()" is always called, even when we are not updating a slot name (which is always a special dunder-name). The linear search for names in "updateslots()" is a huge waste of time here, and short-circuiting out of it when the name does not start with "\" cuts the overall update time by 50%. I pushed a PR.

Another improvement would be a sub-linear algorithm for searching the slot name, but that's a bigger change.

pablogsal commented 5 years ago

This are the timings that I am measuring with PR 11907:

Variable and attribute read access: 5.7 ns read_local 5.9 ns read_nonlocal 16.2 ns read_global 24.5 ns read_builtin 20.9 ns read_classvar_from_class 20.0 ns read_classvar_from_instance 29.7 ns read_instancevar 24.7 ns read_instancevar_slots 22.9 ns read_namedtuple 36.8 ns read_boundmethod

Variable and attribute write access: 6.9 ns write_local 6.9 ns write_nonlocal 26.7 ns write_global 65.4 ns write_classvar \<----- Down from 120.6 ns 49.5 ns write_instancevar 34.5 ns write_instancevar_slots

rhettinger commented 5 years ago

Wow, I didn't expect to get an immediate win this of this magnitude :-)

nascheme commented 5 years ago

Some profiling using 'perf'. This is for cpython 63fa1cfece4912110ce3a0ff11fb3ade3ff5e756.

children self [...] + 97.27% 0.00% run_mod (inlined) + 88.53% 6.33% PyObject_SetAttr + 79.34% 6.80% type_setattro + 43.92% 43.92% update_slot + 26.87% 5.84% _PyObject_GenericSetAttrWithDict + 14.62% 6.52% insertdict + 8.80% 8.80% lookdict_unicode_nodummy + 6.57% 0.00% _Py_DECREF (inlined) + 5.19% 5.19% PyUnicode_InternInPlace + 3.57% 3.57% _PyObjectDict_SetItem + 3.38% 3.38% _PyType_Lookup + 1.71% 0.00% _Py_INCREF (inlined) + 1.42% 0.00% _Py_XDECREF (inlined) [...]

After applying PR 11907:

children self [...] + 94.76% 0.00% run_mod (inlined) + 75.22% 6.77% PyObject_SetAttr + 64.65% 13.08% type_setattro + 47.51% 11.96% _PyObject_GenericSetAttrWithDict + 22.99% 13.95% insertdict + 10.10% 10.10% lookdict_unicode_nodummy + 9.47% 9.47% PyUnicode_InternInPlace + 7.10% 7.10% _PyObjectDict_SetItem + 7.02% 7.02% _PyType_Lookup + 6.52% 0.00% _Py_DECREF (inlined) + 3.17% 0.00% _Py_INCREF (inlined) + 3.16% 0.00% _Py_XDECREF (inlined) + 2.59% 0.00% PyDict_SetItem (inlined) + 1.50% 0.00% is_dunder_name (inlined) [...]

The PyUnicode_InternInPlace() can mostly be eliminated by testing PyUnicode_CHECK_INTERNED() first (we already have called PyUnicode_Check() on it). That only gives a 7% speedup on my machine though. The is_dunder_name() is a much bigger optimization.

nascheme commented 5 years ago

BTW, 'perf report [...]' has a really neat annotated assembly view. Scroll to the function you are interested in and press 'a'. Press 't' to toggle the time units (left side numbers). I'm attaching a screenshot of the disassembly of the lookdict function. The time units are sample accounts. Each count is a point where the profiler woke up on that specific instruction.

rhettinger commented 5 years ago

Thanks Neil. The tooling does indeed look nice.

I added your PyUnicode_InternInPlace() suggestion to the PR.

At this point, the PR looks ready-to-go unless any of you think we've missed some low-hanging fruit.

rhettinger commented 5 years ago

New changeset d8b9e1fc2e45d2bc3f4a9737c375f2adb8a8c7de by Raymond Hettinger (Stefan Behnel) in branch 'master': bpo-36012: Avoid linear slot search for non-dunder methods (GH-11907) https://github.com/python/cpython/commit/d8b9e1fc2e45d2bc3f4a9737c375f2adb8a8c7de

python / cpython

Investigate slow writes to class variables #80193