Closed fangerer closed 1 year ago
@hodgestar left some comments in the IRC channel. I'm posting them here for documentation:
Every time I look at the old C API for it, I go "arg" a lot. It feels more like a perfomance hack that got exposed than an API. However, I'm also not sure what to do about it.
@fangerer Do you have an important / good example use case for the per-instance vector call? What would prevent people who want per-instance calls from just adding their own C function pointer to their struct and doing it themselves?
It feels like we know that a better way is to have our "argument clinic-esque" API for JITs and similar, but that is a lot of work. :/
Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.
Would it be possible to remove HPy_VECTORCALL_ARGUMENTS_OFFSET
from our API and, for example, make a new rule that one can always overwrite args[0]
(i.e. pass the actual array instead of a pointer to the second element)?
@hodgestar: Here are my answers:
Do you have an important / good example use case for the per-instance vector call?
I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data. The PEP says: _"Another source of inefficiency in the tpcall convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created." So, the real world example is "calls to classes"
Would it be possible to remove
HPy_VECTORCALL_ARGUMENTS_OFFSET
from our API and, for example, make a new rule that one can always overwriteargs[0]
(i.e. pass the actual array instead of a pointer to the second element)?
Sure and sounds good to me since it makes it very clear.
Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.
I'm not sure if we even need to take caution concerning compatibility with arg clinic. I think there are two aspects:
Or did I misunderstand your comment?
Do you have an important / good example use case for the per-instance vector call?
I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data. The PEP says: "Another source of inefficiency in the tp_call convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created." So, the real world example is "calls to classes"
The author of nanobind asks for this in CPython stable ABI in here: https://discuss.python.org/t/ideas-for-forward-compatible-and-fast-extension-libraries-in-python-3-12. IIRC he mentioned somewhere that nanobind uses/can use this for all functions (my possibly wrong understanding: every function is a separate object with vectorcall).
Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.
I think there is one more thing to vectorcall (and again, maybe I just misunderstand it :-)). Citing from PEP-590:
_Another source of inefficiency in the tpcall convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created. For a class cls, at least one intermediate object is created for each call in the sequence type.call, cls.new, cls.init.
I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data.
I can give one example from nanobind: its function object dispatches calls to C++ using either a simple implementation (only positional arguments supported) or a complex implementation (with handling of default values, keyword arguments, variable argument count, etc.) that is significantly slower. When a new function object is created, it sets the appropriate vector call dispatcher based on the properties of the function.
This has conflicts. Does it replace #251?
Does it replace #251?
@mattip: No, this PR is basically about supporting something like Py_tp_vectorcall_offset
in the type spec (in other words: this is about how to implement the callee). PR #251 is about how to call something (like HPy_CallTupleDict
).
As far as I got from the discussions here and in the dev calls: we are still not sure if this is the way to go. @hodgestar argued that the vectorcall protocol mostly looks like an exposed implementation detail but with a little bit of extra functionality (in particular, the fact that you can have different call function implementations per object).
I still think the changes in this PR make sense because of following reasons:
tp_vectorcall_offset
that points into the C struct of the type and then the constructor needs to set the function. This is way simpler here in HPy since we introduced a default function.Anyway, before merging this, I would like to have more feedback. In particular from @antocuni .
This has conflicts.
They would be easy to resolve if we decide to merge this.
Not done yet but pushed if people are interested on the progress and to trigger the tests.
Big update on the PR. I've addressed most of @antocuni 's points. Here is a summary:
HPy_tp_call
) and this is mapped to CPython's vectorcall protocol. It is not possible to define the legacy _tpcall protocol in an HPyType_Spec
.HPyFunc_keywords
now is:
typedef HPy (*HPyFunc_keywords)(HPyContext *ctx, HPy self, const HPy *args, size_t nargs, HPy kwnames);
Therefore, we are now also implementing HPyFunc_KEYWORDS
with METH_FASTCALL | METH_KEYWORDS
to avoid unnecessary and slow argument conversion (see C API function signature _PyCFunctionFastWithKeywords). We already use METH_FASTCALL
for HPyFunc_VARARGS
.
HPyHelpers_PackArgsAndKeywords
to convert from fastcall/vectorcall calling convention to the legacy convention (with args tuple and keywords dict).Some other remarks:
HPy_ssize_t nargs
by size_t nargs
mostly because (a) I think a negative count doesn't make sense, and (b) to be compatible to CPython's vectorcall signature.PY_VECTORCALL_ARGUMENTS_OFFSET
flag. nargs
will always just be the positional argument count.HPy_tp_new
and HPy_tp_init
with HPyFunc_keywords
because I feared that this could have a significant performance impact since CPython is calling tp_new
directly and unpacking the keywords dict is an expensive operation. I plan to do some measurements. Anyway, I would like to do this in a separate PR.__vectorcalloffset__
. Should we also intercept that and use a different name?
Resolves #389 .
This PR enables HPy extensions to implement the vectorcall protocol for HPy types. To be clear, this is about providing the possibility to implement the vectorcall protocol on the receiver side (i.e. on the type) as in PEP 590.
In contrast to the C API, I tried to make the common case as easy as possible. Implementing the vectorcall protocol in HPy is now (in the common case) very simple and looks like this:
That's it.
As you might notice, this means that each instance of the type will automatically use the same vectorcall function implementation but one important improvement of PEP 590 is that you can have different (maybe specialized) vectorcall function implementations per object. In order to provide this flexibility, I've introduced API function
HPyVectorcall_Set
that allows to set an arbitrary vectorcall function on an object. For example:As indicated in the above example,
HPyVectorcall_Set
is meant to be used in the object constructor but there is no restriction when it can be used. MacroHPyVectorcall_FUNCTION
is encouraged to be used since it generates the appropriate CPython trampoline and fills the requiredHPyVectorcall
struct.Some more explanation
HPyDef_VECTORCALL(SYM)
is an alias forHPyDef_SLOT(SYM, HPy_tp_vectorcall_default)
. So, this just defines an HPy-specific slotHPy_tp_vectorcall_default
which is the default vectorcall function that will be used for all objects. Ifctx_Type_FromSpec
recognized this slot, following happens behind the scenes:vectorcallfunc
) will be added (at the end) to the CPython object. This increases the basic size (bysizeof(vectorcallfunc)
). It is appended to the object because otherwise the*_AsStruct
calls would return an incorrect pointer.Py_TPFLAGS_HAVE_VECTORCALL
is set automatically__vectorcalloffset__
will be added to the C API slots automatically (using the offset of the hidden field).HPy_tp_new
, we assume thatHPy_New
will be used for allocation which will take care of writing the default vectorcall function pointer to the object (see ctx_type.c:1408).HPy_tp_new
is not provided, we wrap the inheritedtp_new
function withhpyobject_new
(see ctx_type.c:265) which takes care of that.Restrictions
tp_call
function and delegate to the vectorcall impl.HPyDef_VECTORCALL
with var objects (i.e. whereitemsize > 0
). Right now, this is not a big restriction since HPy does not really support them. Howver, it is still possible to do the manual way similar to how it is done in the C API: add a fieldHPyVectorcall vectorcall
to the type's struct, define member__vectorcalloffset__
, set flagHPy_TPFLAGS_HAVE_VECTORCALL
, ...). This is also covered by a test.Misc
From a performance point of view, object creation should not be significantly slower (compared to CPython's vectorcall API) because if (1) the vectorcall protocol is not implemented, we just do an additional type flag check, and if (2) the protocol is implemented, we might do an additional write to the hidden field in case the user overwrites the default function.
I still did not write documentation about that. I will do in a follow-up PR.