[FEAT] [RFC] Class builder API for high performance bindings

lqf96 commented 3 years ago

Motivation

Pybind11 provides easy-to-use APIs for writing Python bindings. However, these bindings are often implemented with multiple level of indirections, which can have negative impact on the performance. For example, class_::def_property and class_::def_property_readonly are currently implemented as:

A PyCFunction wrapper capturing a capsule holding a detail::function_record
... wrapped in an instancemethod
... wrapped in a property stored in the class dictionary.

Compared to a getset_descriptor directly implemented with PyGetSetDef, this can be 5x-20x slower both in terms of execution time and instruction counts. Similarly, constructors implemented in pybind11 can be slower than a low-level constructor function directly bound to type with the tp_init slot. Both problems show the necessity of having a set of high performance binding APIs for pybind11.

Design

The new API is called "class builder" because it employs the builder pattern:

pybind11::class_builder<Example>(m, "Example")
    .def("method", [](Example& e) {})
    .def_class("class_method", [](Example& e) {})
    .def_static("static_method", []() { return 1; })
    .def_attr_readonly("readonly_attr", []() { return 1; })
    .build();

To create high-performance bindings for a class, we would call pybind11::class_builder instead of pybind11::class_ with almost exactly the same parameters. This gives us a class builder, which we can use to define a series of methods and attributes (note these are getset_descriptors not propertys). After we define all the members of the class, we call the build method, which will create the Python class and return a pybind11::class_<...> instance.

Internally, def_attr and def_attr_readonly is implemented as adding a new entry to the tp_getset slot, and def, def_class and def_static is implemented as adding a new entry to the tp_methods slot. The build method will call PyType_Ready method on the type being built, and wraps the ready PyTypeObject in a pybind11::class_<...> instance.

Misc

With the class builder API, we are going to directly set the slots of the type object. Hence, we need to split cpp_function into two parts: object wrapper and dispatching logic. For the class builder API, the previous part isn't needed. We'll however keep it for backward compatibility.
Should we have a separate class_builder API, or should we just rework the internals of pybind11::class_? Personally I think it depends on whether we are allowed to add new PyMethodDef and PyGetSetDef after calling PyType_Ready. I guess the answer is probably no so a separate set of APIs is likely needed.
Pybind11 functions are, in fact, closures capturing a detail::function_record, and it is challenging to make them work with PyMethodDef, because PyMethodDef has no field to store the closure pointer (PyGetSetDef has a closure pointer and does not have this problem). Perhaps we will need the closure functionality from libffi to solve this problem?

bstaletic commented 3 years ago

If we can really get this kind of performance improvements, I am very much interested. How confident are you that this considerable effort is going in the right direction?

I must say that I'm not a fan of "duplicate" API. Reworking the internals of py::class_ and changing its API, on the other hand, would be a breaking change. Maintaining duplicate APIs is always a maintenance burden as well. Pybind still supports PYBIND11_PLUGIN...

I was also thinking about moving away from heap types and creating static ones. That would require type objects with static lifetime and I don't think pybind can do that without leaking the type objects.

lqf96 commented 3 years ago

@bstaletic Actually I'm not sure... I'm relatively confident that not using PyCFunction and instancemethod can speed up method and attribute invocations, but even then they wouldn't be as fast as direct Python bindings because of the overhead of overloads and conversions. As for setting the slots vs setting the type dictionary for magic methods, I'm not certain. Perhaps I can come up with a prototype and benchmark it with direct bindings to see how much improvement we can get in the best case?

bstaletic commented 3 years ago

they wouldn't be as fast as direct Python bindings because of the overhead of overloads and conversions.

That's a given. Conversions can be tamed with extensive usage of noconvert, even if a little annoying. Let's leave overloads aside, because the proposed refactoring is already pretty big.

As for setting the slots vs setting the type dictionary for magic methods, I'm not certain.

This the one I was interested the most in. I'm sure I don't need to tell you that actual numbers that show where pybind11 is spending the time is going to help. Either in convincing others that this is a good idea or saving time by not going all in for tiny gains.

lqf96 commented 3 years ago

As for setting the slots vs setting the type dictionary for magic methods, I'm not certain.

This the one I was interested the most in. I'm sure I don't need to tell you that actual numbers that show where pybind11 is spending the time is going to help. Either in convincing others that this is a good idea or saving time by not going all in for tiny gains.

Actually considering slots for magic methods complicates things so I'm thinking about just ignoring it for the time being. Let me create a prototype that handles method and attribute definition, but does not support overloading. I'll then compare it to direct bindings to see how it works.

lqf96 commented 3 years ago

Update: I created a very rough prototype and compared the performance of Pybind11 bindings, proposed API bindings and direct bindings (Python C APIs only) on four situations:

Instance creation (type-init)
Method call (call-method)
Get property / attribute (get-prop)
Set property / attribute (set-prop)

And the benchmark result is as follows:

Binding Type	type-init	call-method	get-prop	set-prop
Pybind11	918ns	438ns	962ns	1.45us
Proposed	453ns	453ns	906ns	1.57us
Direct	149ns	74ns	561ns	1.14us

Note that the prototype for the proposed class_builder API is rough and reuses cpp_function so there maybe room for performance improvement. Also, class_builder and direct bindings currently convert arguments and return value manually, while pybind11 bindings do these automatically. It turns out that setting the tp_new slot for initialization is actually faster than calling __init__ (which is what pybind11 is doing), but on the other hand there is little performance difference on method invocation and property / attribute access. Given that in pybind11 property access and method invocation goes through multiple levels of indirection, the result is surely surprising.

Another problem with the proposed API is that it uses the closure feature from libffi because PyMethodDef does not have field for storing the closure pointer for the member function. This means an extra dependency has to be added for pybind11 (although libffi is required if you want to build the ctypes module of the Python standard library), which definitely requires more discussion.

bstaletic commented 3 years ago

Given that in pybind11 property access and method invocation goes through multiple levels of indirection, the result is surely surprising.

You can say that again. I was really expecting some more encouraging numbers. The "direct" row in the table shows that there is quite an overhead to what pybind is doing...

For the record, I'm not making any calls right now. I'd like to hear from others first.

YannickJadoul commented 3 years ago

Thanks for the overview of your proposal, @lqf96 (and apologies for my delayed feedback) !

The proof-of-concept route indeed seems like the right route to take and to clarify some things. Given the most recent results, some profiling might also be able to help.

Apart from that, it will also really be up to @wjakob to judge whether such an interface is in the scope of pybind11. The kind of abstractions that pybind11 provides will of course always have some overhead, but I agree that 5-20x is a lot.

lqf96 commented 3 years ago

You can say that again. I was really expecting some more encouraging numbers. The "direct" row in the table shows that there is quite an overhead to what pybind is doing...

For the record, I'm not making any calls right now. I'd like to hear from others first.

Indirection here means that, as I said above, every pybind11 method is implemented as:

A PyCFunction wrapper capturing a capsule holding a detail::function_record
... wrapped in an instancemethod

Additionally, each property also has this whole thing wrapped again in a property stored in the class dictionary, whereas if you try to implement method or attribute in pure Python C API, the implementation is directly provided in PyMethodDef or PyGetSetDef without any wrapping.

The benchmark result is surprising to me because I initially thought these multiple levels of wrapping are the culprit of pybind11 call overhead, however replacing them with direct PyMethodDef and PyGetSetDef definitions did not improve the performance. This suggests that the major overhead seems to be within cpp_function and we probably want to optimize the function call dispatch mechanism instead.

lqf96 commented 3 years ago

@YannickJadoul Regardless of the benchmark result I got, I still want to move cpp_function out of pybind11.h because pybind11.h is too long and this makes it very hard to do any potential refactoring. I can revive #2807 by simply moving the code of cpp_function to a separate header file and avoid doing any further refactoring in that PR.

YannickJadoul commented 3 years ago

I currently don't really see the advantage of that outweighing all the potential merge conflict and confusion, to be fair.

lqf96 commented 3 years ago

@YannickJadoul From the experimental results, it seems that improving the efficiency of cpp_function could help with the performance... Here's a few potential ideas I come up with:

Use METH_NOARGS and METH_O for functions with at most zero / one arguments.
For Python 3.7+, use METH_FASTCALL calling convention.
Rework argument type conversions so that for each pair of Python argument and corresponding C++ type, it is converted only once.

I'm fine with not moving the code for cpp_function, but making these refactors can significantly expand the code of cpp_function, and these are also going to have negative impact on code readability and editor responsiveness (ugh, not everyone is using vim or emacs)...

bstaletic commented 3 years ago

For Python 3.7+, use METH_FASTCALL calling convention.

Doesn't that disable foo(x = 3) kind of arguments? I was thinking about making METH_* flags configurable, with the default remaining METH_ARGS | METH_KWARGS, to not break backwards compatibility.

lqf96 commented 3 years ago

For Python 3.7+, use METH_FASTCALL calling convention.

Doesn't that disable foo(x = 3) kind of arguments? I was thinking about making METH_* flags configurable, with the default remaining METH_ARGS | METH_KWARGS, to not break backwards compatibility.

Why? Keyword arguments can be supported with METH_KWARGS | METH_FASTCALL, for which you'll provide a _PyCFunctionFastWithKeywords. There really shouldn't be any backward incompatibility here...

bstaletic commented 3 years ago

Oh, I wasn't aware you can mix METH_KWARGS and METH_FASTCALL. Thanks for pointing that out.

YannickJadoul commented 3 years ago

I'm fine with not moving the code for cpp_function, but making these refactors can significantly expand the code of cpp_function, and these are also going to have negative impact on code readability and editor responsiveness (ugh, not everyone is using vim or emacs)...

No, if that's part of a rewrite, then refactoring can make perfect sense (maybe not even to detail, but just to cpp_function.h). It's just the moving-cpp_function-for-the-sake-of-refactoring that I'm not too convinced about.

For the record, I've sent @wjakob a message, since he, as original author and project leader, is still the one to call shots on "in scope"/"out of scope" for pybind11. Any demonstration of improvements in efficiency at a reasonable cost in maintenance/code size/binary size can only help convincing, though :-)

Rework argument type conversions so that for each pair of Python argument and corresponding C++ type, it is converted only once.

This would disable (or massively complicate) overload resolution, though? But curious to see if that's not the case.

pybind / pybind11