Closed skrah closed 7 months ago
Can you try --disable-gil
with current main?
The figures are almost the same, around 30% (enable) and 43% (disable).
The base slowdown of 30% is of course due to the module state changes. For comparison, and to see that the i5-8600K is not inherently underperforming with threads:
The ideal situation is implemented in libmpdec++: It has a C++11 thread local context, which is also called for every inline operation. The C++11 TLS adds just 4% overhead compared to pure C (where the context is passed directly to the functions).
This of course would be quite difficult to achieve in CPython. But the thread local context may not even be the primary culprit. In Martin von Loewis' first module state implementation the basic slowdown for _decimal was 20%, and a good part of that was due to the heap types.
The ideal situation is implemented in libmpdec++: It has a C++11 thread local context, which is also called for every inline operation. The C++11 TLS adds just 4% overhead compared to pure C (where the context is passed directly to the functions).
This of course would be quite difficult to achieve in CPython.
The wording was not ideal: The key point is that C++ achieves this without a cached context, which is of course now in the module state.
@encukou, any thoughts on this?
~PyModule_GetState()
~ Getting a module state via PyType_GetModuleByDef()
seems to be expensive for heavy use.
It would be nice if a heap type could own the module state's address at its creation time to recover more than 10%.
PyModule_GetState()
viaPyType_GetModuleByDef()
seems to be expensive for heavy use. It would be nice if a heap type could own the module state's address at its creation time to recover more than 10%.
_PyModule_GetState()
is used, not PyModule_GetState()
; I'd be surprised if this is the culprit:
OTOH, PyType_GetModuleByDef()
can be more expensive. Storing the state pointer in the type struct is a possibility.
@ericsnowcurrently Can a module state be accessed correctly with the following?
_Py_thread_local module_state *thread_local_state; // set at the exec phase
If not invalid, it will be useful for the function which currently has no way to get the module state. A _Py_thread_local
pointer may not be so fast to access, but it can also bypass a state struct?
Multiple interpreters can run in the same thread, so a thread-local may be ambiguous in that case.
As to storing a reference to the module state on the type object, that shouldn't be necessary. Every heap type as a ht_module
field that points to the module, accessible via PyType_GetModule()
. From there you call PyModule_GetState()
. In fact, we already have PyType_GetModuleState()
that combines those two calls. That's fairly efficient, though obviously not as efficient as a static global.
The only caveat is that PyType_GetModule()
(and thus PyType_GetModuleState()
) operates relative to the given type. Thus subclasses make things trickier. IIRC, when they're involved, you end up having to walk the MRO one way or another, which is what PyType_GetModuleByDef()
does.
FWIW, the common case where PyType_GetModuleByDef()
is used is either in "tp slot" functions or intermediate static functions that aren't doing the right thing. In both cases the problem lies in throwing away the expensive information we already have. For tp slot functions the API doesn't pass the module in, even though it could. We're stuck with backward compatibility issues there. For intermediate static functions, I expect it's all too common to figure out the module (or module state) at some "leaf node" function but not pass it through the intermediate functions. Thus the relatively expensive PyType_GetModuleByDef()
gets called more than it needs to be.
I suppose another problem case is with callbacks that don't provide a way to pass the module through. I'm not sure that's a common issue though.
Back to the _decimal module, I haven't looked at what usage patterns are resulting in such a significant slowdown.
Thus subclasses make things trickier. IIRC, when they're involved, you end up having to walk the MRO one way or another, which is what
PyType_GetModuleByDef()
does.
Yes; heap types that are not base classes can go the fast route via PyType_GetModuleState()
(and _PyType_GetModuleState()
).
On February 2, 2024 11:10:49 PM GMT+01:00, Eric Snow @.***> wrote:
Multiple interpreters can run in the same thread, so a thread-local may be ambiguous in that case.
As to storing a reference to the module state on the type object, that shouldn't be necessary. Every heap type as a
ht_module
field that points to the module, accessible viaPyType_GetModule()
. From there you callPyModule_GetState()
. In fact, we already havePyType_GetModuleState()
that combines those two calls. That's fairly efficient, though obviously not as efficient as a static global.The only caveat is that
PyType_GetModule()
(and thusPyType_GetModuleState()
) operates relative to the given type. Thus subclasses make things trickier. IIRC, when they're involved, you end up having to walk the MRO one way or another, which is whatPyType_GetModuleByDef()
does.FWIW, the common case where
PyType_GetModuleByDef()
is used is either in "tp slot" functions or intermediate static functions that aren't doing the right thing. In both cases the problem lies in throwing away the expensive information we already have. For tp slot functions the API doesn't pass the module in, even though it could. We're stuck with backward compatibility issues there. For intermediate static functions, I expect it's all too common to figure out the module (or module state) at some "leaf node" function but not pass it through the intermediate functions. Thus the relatively expensivePyType_GetModuleByDef()
gets called more than it needs to be.I suppose another problem case is with callbacks that don't provide a way to pass the module through. I'm not sure that's a common issue though.
Back to the _decimal module, I haven't looked at what usage patterns are resulting in such a significant slowdown.
-- Reply to this email directly or view it on GitHub: https://github.com/python/cpython/issues/114682#issuecomment-1924776487 You are receiving this because you were mentioned.
Message ID: @.***> -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
@encukou I think your comment might have gotten lost? All I see is quote text
The float
object gets the luxury of its own freelist and avoids the slow PyObject_GC_*
functions altogether. The Decimal
constructor is a pure function (the context is only used for recording errors) and Decimals
are immutable, just like floats
.
It is likely that 15 percentage points can be shaved off just by doing the same as in floatobject.c
and then focus on the thread local context (indeed, PyType_GetModuleByDef()
is called at dozens of places).
AFAICS, we can trim away a lot of the PyType_GetModuleByDef
calls. For example, a lot of them are happening in methods; we can use the METH_METHOD
calling convention on practically all of these, so we can access the module state using _PyType_GetModuleState
:
Also (but not performance relevant), the find_state_left_or_right()
calls in the Py_tp_richcompare
slots can be replaced with get_module_state_by_def()
calls.
FTR: I did a quick proof-of-concept for some low-hanging fruit, and I already got a 4% 10% 12% speedup.
So, here's my suggested approach. Let's move the relevant parts of bench.py over to pyperformance. That way, it is easier to keep an eye on performance regressions like this; the faster-cpython team is alert to sudden changes in a particular benchmark.
I'll try to tear out as many PyType_GetModuleByDef
as possible and create a PR, hopefully within this week. The diff is going to be huge, but OTOH it is a mechanical change.
Thank you for getting to this so fast! Let me know if you need any help, and feel free to assign me to reviews.
Let me know if you need any help, and feel free to assign me to reviews.
Perhaps you could migrate bench.py over to pyperformance?
For now, I've used this patch for local benchmarking with pyperf[^1]:
[^1]: for the pi benchmark only!
@skrah Do you have a significant regression test case where subclasses are frequently used? 3.12 (sigle-phase init/static types) was almost as slow as main in the given test for me once D = C.Decimal
was replaced with class D(C.Decimal)
.
@skrah Do you have a significant regression test case where subclasses are frequently used? 3.12 (sigle-phase init/static types) was almost as slow as main in the given test for me once
D = C.Decimal
was replaced withclass D(C.Decimal)
.
PyType_GetModuleByDef
is slower for subclasses. Reducing the number of PyType_GetModuleByDef
will have an impact.
@neonene Your test is reasonable! Subclasses are so slow that within 3.9 thatfloat
is slowed down by 3.3 times and _decimal
by 2.2 times.
The slowdown for subclasses within main is something like 4.5 times for float and 1.9 times for decimal. That seems to suggest, as I mentioned earlier, that certain things are better optimized for base class floats
than for base class Decimals
.
But in general, compared to 3.9, subclasses themselves seem to have gotten faster in main, so all comparisons are a bit messy.
Regarding the module-state-access after METH_METHOD
(PEP 573) is applied to _decimal
's methods, the concerns will move to hot tp_*
functions, if any. A possible optimization would be:
PyObject *module = PyType_GetModule(type);
if (!module || _PyModule_GetDef(module) != module_def) {
PyErr_Clear();
// Subclasses can ignore the overhead for now? Otherwise,
// the type should cache the module state (refcount-free) with the def.
module = PyType_GetModuleByDef(type, module_def);
}
module_state *state = _PyModule_GetState(module);
Currently, a faster version of PyType_GetModule()
is not offered, right?
Regarding the module-state-access after METH_METHOD (PEP 573) is applied to decimal's methods, the concerns will move to hot tp* functions, if any.
In 2018, I wrote a change to use FASTCALL calling convention which made the telco benchmark 22% faster. But Stefan Krah was the module maintainer and was against the idea, so I gave up. See issue gh-73487.
PyType_GetModuleByDef is slower for subclasses. Reducing the number of PyType_GetModuleByDef will have an impact.
If PyType_GetModuleByDef() overhead (compared to the previous version which didn't use it) is too high, we can cache directly important data (such as decimal_state *state
) in instances (ex: PyDecObject
), rather than getting it from the instance type.
[...] we can cache directly important data (such as decimal_state *state) in instances (ex: PyDecObject), rather than getting it from the instance type.
Yes, I already suggested it; we do this in the sqlite3 extension module with great success.
In 2018, I wrote a change to use FASTCALL calling convention which made the telco benchmark 22% faster. But Stefan Krah was the module maintainer and was against the idea, so I gave up. See issue gh-73487.
I see Serhiy's post point to a bpo post where Stefan disapproved of applying Argument Clinic only. Well, I used Argument Clinic extensively in my patch for speeding up _decimal
; it did not break a single test ;)
Currently, a faster version of
PyType_GetModule()
is not offered, right?
That would be using PyHeapTypeObject.ht_module
directly. If we know it's heap type, the check for Py_TPFLAGS_HEAPTYPE
can become an assert.
That is an extraordinary comment from Mr. Stinner. It is entirely irrelevant to this issue:
This issue is about the slowdown introduced by the module state changes.
If AC speeds up _decimal
in the fix, that speedup just camouflages the module state slowdown and makes honest benchmarks for this issue impossible!
In the linked FASTCALL issue, both @rhettinger and ultimately Mr. Stinner himself concurred with me or at least acquiesced.
At that time the new FASTCALL convention was already on the horizon and would have required another change!
At that time I planned to use the new FASTCALL convention directly.
I was not the "module maintainer", but the author. Big difference.
In short, bringing up a heavily editorialized version of the actual events just distracts from the problems at hand.
I withdraw myself from this discussion; I am no longer working on this issue.
If PyType_GetModuleByDef() overhead (compared to the previous version which didn't use it) is too high, we can cache directly important data (such as decimal_state *state) in instances (ex: PyDecObject), rather than getting it from the instance type.
The cache in a known object is mainly used to reduce function arguments in the call chains, right? Indeed, many PyType_GetModuleByDef()
can be removed there.
I think a cache in the type instance can be used in the tp slot functions where PyType_GetModuleByDef()
needs to be placed to check(exact) the given object's type through a module state, which seems not to be beneficial for the current very slow subclasses.
I think we can measure/consider the boost by FASTCALL
separately.
That is an extraordinary comment from Mr. Stinner. It is entirely irrelevant to this issue
I reacted to a comment about METH_METHOD. For me, a common way to use METH_METHOD is to use Argument Clinic with cls: defining_class
.
@skrah METH_METHOD|METH_FASTCALL
might be (worse than) neutral on the pi test. I'm leaving an experimental PR #115196, which was my first AC experience.
@neonene Thanks for doing this! AC does not seem to affect the number methods (nb_add
etc.) used in the pi benchmark. So it is indeed entirely irrelevant to this issue (as you hinted at).
I cannot reproduce the stated 22% improvement for the telco benchmark (telco.py full) either. I get something around 10%.
If you want to work on a patch without AC, I can review it. The strategy suggested earlier of caching the state in the decimal object seems to be the most fruitful (intuitively, the actual patch will require some experimentation).
Generally, it would be good to have a reference module in Python core that does not use AC. AC is unused/not available in the vast majority of C-extensions. Many modules in the scientific ecosystem may want to use module state and vectorcall but not AC.
~I hit a crash at the METH_METHOD
method whose defining_class
argument was NULL
: https://github.com/python/cpython/pull/115196#discussion_r1484911702. Does PEP 573 require an argument check before PyType_GetModuleState(cls)
?~
I hit a crash at the METH_METHOD method whose defining_class argument was NULL
I take back. *_impl(self, cls)
has in the module several callers which previously passed a NULL
as a dummy.
I got no performance change by switching to _PyType_GetModuleState()
in the *_impl
functions: 1eeb531. As to PEP-573, applying METH_METHOD
only to hot methods would be enough, if any.
def pi():
import _decimal
D = _decimal.Decimal
for i in range(10000):
lasts, t, s, n, na, d, da = D(0), D(3), D(3), D(1), D(0), D(0), D(24)
while s != lasts:
lasts = s
n, na = n+na, na+8
d, da = d+da, da+32
t = (t * n) / d
s += t
Module state access related? functions:
I'm getting a smaller speedup by using PyType_GetModuleState()
in the number methods and a larger one by using PyObject_New
for the decimal constructor.
The _threadmodule
itself uses PyObject_New
in new_thread_handle
.
@ericsnowcurrently, is there a reason why PyObject_New
cannot be used for the immutable Decimal
type?
@erlend-aasland:
I withdraw myself from this discussion; I am no longer working on this issue.
@skrah:
That is an extraordinary comment from Mr. Stinner. It is entirely irrelevant to this issue
Well, same than Erlend for me: I unsubscribe. I'm not interested to work on an issue with such tone in the discussion.
is there a reason why
PyObject_New
cannot be used for the immutableDecimal
type?
I'm not aware of any reason. It isn't a GC type, right?
I'm not aware of any reason. It isn't a GC type, right?
Yes, up to 3.12
it wasn't a GC type. It has been made a GC type in main
during the module state changes.
I now found some explanations in PEP-630, but I still do not understand it:
Instances of heap types hold a reference to their type.
This ensures that the type isn't destroyed before all its instances are,
but may result in reference cycles that need to be broken by the
garbage collector.
How would one create a cycle?
>>> x = Decimal("3.222")
>>> x.y = x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
x.y = x
^^^
AttributeError: 'decimal.Decimal' object has no attribute 'y' and no __dict__ for setting new attributes
>>> x = Decimal("3.222")
>>> t = type(x)
>>> t.y = x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
t.y = x
^^^
TypeError: cannot set 'y' attribute of immutable type 'decimal.Decimal'
>>> x.adjusted = 10
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
x.adjusted = 10
^^^^^^^^^^
AttributeError: 'decimal.Decimal' object attribute 'adjusted' is read-only
The strategy suggested earlier of caching the state in the decimal object seems to be the most fruitful (intuitively, the actual patch will require some experimentation).
@skrah I have posted a draft: gh-115401
Yes, up to 3.12 it wasn't a GC type. It has been made a GC type in main during the module state changes.
Ah, there are definitely some subtleties involved with GC types, especially when switching from a non-GC type. I don't recall the specifics, but I'm sure @encukou does.
It might be more interesting to start with static (non-GC) types: these are “immortal” (statically allocated), but they do have a bunch of dynamically allocated, interpreter-specific data (e.g. subclasses, weakrefs). Getting this data properly cleaned up requires rather special hacks, appropriate for core types that really are created/destroyed with the interpreter itself, rather than extensions loaded on demand (see code around static_types[]
in Objects/object.c
for details).
On the other hand, GC types (heap types) use reference counting like any other Python object. Not saying that getting rid of all the assumptions of static-ness/immortality is/was smooth, but at least the model is much clearer: we have a good idea of what the details should be, even though getting there is a bumpy road full of backcompat issues.
Instances of heap types hold a reference to their type. This ensures that the type isn't destroyed before all its instances are, but may result in reference cycles that need to be broken by the garbage collector.
This is a generic statement. A class with no __dict__
nor user-settable attributes probably can't create a reference cycle.
While we could omit the instance→type reference only in cases where one can prove this, I don't think such a special case would be worth it.
How would one create a cycle?
PEP 573 and type_traverse()
says:
Usually, creating a class with ht_module set will create
a reference cycle involving the class and the module.
To make Decimal
type a non-GC heap type, at least the module
arg needs to be NULL
in PyType_FromMetaclass()
, like thread_handle_type
in _threadmodule
?
I agree that PyObject_GC_*()
has some overhead, but I'm not sure how to hold a module safely outside Decimal
type.
FYI - skrah has been banned from python spaces, quoting details from my message posted on a duplicate-ish issue: https://github.com/python/cpython/issues/114922#issuecomment-1989726811
""" Procedural comment: For all involved on this issue, @skrah was banned from Python spaces per a Conduct Working Group recommendation to the Steering Council in 2020. A configuration oversight on our part in 2020 made that not so on Github. The intended ban of @skrah was reinstated on 2024-02-22. Our apologies to all for the consequences of this mixup (it was the first core dev ban done on our modern systems, we now understand how and why mistakes were made).
Stefan was informed of this privately via email and re-offered the opportunity to apologize for past actions and to accept our code of conduct going forwards per the original 2020 ban's terms and notification. We have not received a reply.
I'm posting this here as we view replying in the same forum where issues recently surfaced a reasonable way to notify past, present, and potential future participants of an incident response.
This says nothing for or against the validity of this issue. A new issue should be opened if people wish to proceed. I'm locking the comments on this one. For information on what the Github concept of a ban means, read the Github org ban docs.
-- The Python Steering Council, 2024 edition ""
Bug report
Bug description:
The pi benchmark in
Modules/_decimal/tests/bench.py
is up to 31% slower with--enable-gil
and up to 45% slower with--disable-gil
compared to Python 3.9 as the baseline.Change the number of iterations to
100000
for more stable benchmarks.This is heavily OS and hardware dependent. The worst case cited above was observed on a six core i5-8600K.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Linked PRs