Closed rhettinger closed 5 years ago
Benchmark show what writes to class variables are anomalously slow.
class A(object):
pass
A.x = 1 # This write is 3 to 5 times slower than other writes.
FWIW, the same operation for old-style classes in Python 2.7 was several times faster.
We should investigate to understand why the writes are so slow. There might be a good reason or there might be an opportunity for optimization.
-------------------------------------------------
$ python3.8 Tools/scripts/var_access_benchmark.py
Variable and attribute read access:
4.3 ns read_local
4.6 ns read_nonlocal
14.5 ns read_global
19.0 ns read_builtin
18.4 ns read_classvar_from_class
16.2 ns read_classvar_from_instance
24.7 ns read_instancevar
19.7 ns read_instancevar_slots
19.5 ns read_namedtuple
26.4 ns read_boundmethod
Variable and attribute write access: 4.4 ns write_local 5.1 ns write_nonlocal 18.2 ns write_global 103.9 ns write_classvar \<== Outlier 35.4 ns write_instancevar 25.6 ns write_instancevar_slots
It seems 50% of the overhead (50ns) is due to two reasons:
30-40% is due to the call to update_slot(type, name) after the item is set in the class dictionary.
70-60% is due to all the extra work from _PyObject_GenericSetAttrWithDict until it ends calling _PyObjectDict_SetItem with the slot. If in the code of typeobject.c:type_setattro you change:
res = _PyObject_GenericSetAttrWithDict((PyObject *)type, name, value, NULL);
by
PyObject* dictptr = _PyObject_GetDictPtr(type);
res = _PyObjectDict_SetItem(type, dictptr, name, value);
and delete the update_slot(type, name) call afterwards, the times are reduced to 50ns.
can you include your python 2.7 runs? for me it looks similar
can you include your python 2.7 runs? for me it looks similar
It will give similar results unless you switch to old-style classes (edit out the inheritance from object).
class A:
pass
A.x = 1
$ python2.7 var_access_benchmark.py
Variable and attribute read access:
6.7 ns read_local
10.9 ns read_global
18.9 ns read_builtin
24.4 ns read_classvar_from_class
30.2 ns read_classvar_from_instance
22.7 ns read_instancevar
25.5 ns read_instancevar_slots
99.3 ns read_namedtuple
40.9 ns read_boundmethod
Variable and attribute write access: 8.2 ns write_local 18.7 ns write_global 32.9 ns write_classvar \<== Not formerly an outlier 32.5 ns write_instancevar 31.2 ns write_instancevar_slots
It turns out that "update_slot()" is always called, even when we are not updating a slot name (which is always a special dunder-name). The linear search for names in "updateslots()" is a huge waste of time here, and short-circuiting out of it when the name does not start with "\" cuts the overall update time by 50%. I pushed a PR.
Another improvement would be a sub-linear algorithm for searching the slot name, but that's a bigger change.
This are the timings that I am measuring with PR 11907:
Variable and attribute read access: 5.7 ns read_local 5.9 ns read_nonlocal 16.2 ns read_global 24.5 ns read_builtin 20.9 ns read_classvar_from_class 20.0 ns read_classvar_from_instance 29.7 ns read_instancevar 24.7 ns read_instancevar_slots 22.9 ns read_namedtuple 36.8 ns read_boundmethod
Variable and attribute write access: 6.9 ns write_local 6.9 ns write_nonlocal 26.7 ns write_global 65.4 ns write_classvar \<----- Down from 120.6 ns 49.5 ns write_instancevar 34.5 ns write_instancevar_slots
Wow, I didn't expect to get an immediate win this of this magnitude :-)
Some profiling using 'perf'. This is for cpython 63fa1cfece4912110ce3a0ff11fb3ade3ff5e756.
children self [...] + 97.27% 0.00% run_mod (inlined) + 88.53% 6.33% PyObject_SetAttr + 79.34% 6.80% type_setattro + 43.92% 43.92% update_slot + 26.87% 5.84% _PyObject_GenericSetAttrWithDict + 14.62% 6.52% insertdict + 8.80% 8.80% lookdict_unicode_nodummy + 6.57% 0.00% _Py_DECREF (inlined) + 5.19% 5.19% PyUnicode_InternInPlace + 3.57% 3.57% _PyObjectDict_SetItem + 3.38% 3.38% _PyType_Lookup + 1.71% 0.00% _Py_INCREF (inlined) + 1.42% 0.00% _Py_XDECREF (inlined) [...]
After applying PR 11907:
children self [...] + 94.76% 0.00% run_mod (inlined) + 75.22% 6.77% PyObject_SetAttr + 64.65% 13.08% type_setattro + 47.51% 11.96% _PyObject_GenericSetAttrWithDict + 22.99% 13.95% insertdict + 10.10% 10.10% lookdict_unicode_nodummy + 9.47% 9.47% PyUnicode_InternInPlace + 7.10% 7.10% _PyObjectDict_SetItem + 7.02% 7.02% _PyType_Lookup + 6.52% 0.00% _Py_DECREF (inlined) + 3.17% 0.00% _Py_INCREF (inlined) + 3.16% 0.00% _Py_XDECREF (inlined) + 2.59% 0.00% PyDict_SetItem (inlined) + 1.50% 0.00% is_dunder_name (inlined) [...]
The PyUnicode_InternInPlace() can mostly be eliminated by testing PyUnicode_CHECK_INTERNED() first (we already have called PyUnicode_Check() on it). That only gives a 7% speedup on my machine though. The is_dunder_name() is a much bigger optimization.
BTW, 'perf report [...]' has a really neat annotated assembly view. Scroll to the function you are interested in and press 'a'. Press 't' to toggle the time units (left side numbers). I'm attaching a screenshot of the disassembly of the lookdict function. The time units are sample accounts. Each count is a point where the profiler woke up on that specific instruction.
Thanks Neil. The tooling does indeed look nice.
I added your PyUnicode_InternInPlace() suggestion to the PR.
At this point, the PR looks ready-to-go unless any of you think we've missed some low-hanging fruit.
New changeset d8b9e1fc2e45d2bc3f4a9737c375f2adb8a8c7de by Raymond Hettinger (Stefan Behnel) in branch 'master': bpo-36012: Avoid linear slot search for non-dunder methods (GH-11907) https://github.com/python/cpython/commit/d8b9e1fc2e45d2bc3f4a9737c375f2adb8a8c7de
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['interpreter-core', '3.8', 'performance']
title = 'Investigate slow writes to class variables'
updated_at =
user = 'https://github.com/rhettinger'
```
bugs.python.org fields:
```python
activity =
actor = 'rhettinger'
assignee = 'none'
closed = True
closed_date =
closer = 'rhettinger'
components = ['Interpreter Core']
creation =
creator = 'rhettinger'
dependencies = []
files = ['48157']
hgrepos = []
issue_num = 36012
keywords = ['patch']
message_count = 11.0
messages = ['335714', '335733', '335747', '335763', '335776', '335794', '335795', '335992', '335995', '336031', '336129']
nosy_count = 5.0
nosy_names = ['nascheme', 'rhettinger', 'scoder', 'pablogsal', 'Yasser Alshalaan']
pr_nums = ['11907']
priority = 'low'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue36012'
versions = ['Python 3.8']
```