hpyproject / hpy

HPy: a better API for Python
https://hpyproject.org
MIT License
1.02k stars 52 forks source link

Support vectorcall protocol #390

Closed fangerer closed 1 year ago

fangerer commented 1 year ago

Resolves #389 .

This PR enables HPy extensions to implement the vectorcall protocol for HPy types. To be clear, this is about providing the possibility to implement the vectorcall protocol on the receiver side (i.e. on the type) as in PEP 590.

In contrast to the C API, I tried to make the common case as easy as possible. Implementing the vectorcall protocol in HPy is now (in the common case) very simple and looks like this:

HPyDef_VECTORCALL(SomeObject_vectorcall)
static HPy
SomeObject_vectorcall_impl(HPyContext *ctx, HPy callable, HPy *args, HPy_ssize_t nargsf, HPy kwnames)
{
    // ...
}

static HPyDef *SomeObject_defines[] = { &SomeObject_vectorcall, NULL };

That's it.

As you might notice, this means that each instance of the type will automatically use the same vectorcall function implementation but one important improvement of PEP 590 is that you can have different (maybe specialized) vectorcall function implementations per object. In order to provide this flexibility, I've introduced API function HPyVectorcall_Set that allows to set an arbitrary vectorcall function on an object. For example:

HPyVectorcall_FUNCTION(Point_special_vectorcall)
static HPy
Point_special_vectorcall_impl(HPyContext *ctx, HPy callable, HPy *args, HPy_ssize_t nargsf, HPy kwnames)
{
    // ...
}

HPyDef_SLOT(Point_new, HPy_tp_new)
static HPy Point_new_impl(HPyContext *ctx, HPy cls, HPy *args, HPy_ssize_t nargs, HPy kw)
{
    // ...
    HPyVectorcall_Set(ctx, h_point, &Point_special_vectorcall);
    // ...
}

As indicated in the above example, HPyVectorcall_Set is meant to be used in the object constructor but there is no restriction when it can be used. Macro HPyVectorcall_FUNCTION is encouraged to be used since it generates the appropriate CPython trampoline and fills the required HPyVectorcall struct.

Some more explanation

HPyDef_VECTORCALL(SYM) is an alias for HPyDef_SLOT(SYM, HPy_tp_vectorcall_default). So, this just defines an HPy-specific slot HPy_tp_vectorcall_default which is the default vectorcall function that will be used for all objects. If ctx_Type_FromSpec recognized this slot, following happens behind the scenes:

  1. An additional field (of type vectorcallfunc) will be added (at the end) to the CPython object. This increases the basic size (by sizeof(vectorcallfunc)). It is appended to the object because otherwise the *_AsStruct calls would return an incorrect pointer.
  2. Flag Py_TPFLAGS_HAVE_VECTORCALL is set automatically
  3. Member __vectorcalloffset__ will be added to the C API slots automatically (using the offset of the hidden field).
  4. In case of the type also has a custom slot HPy_tp_new, we assume that HPy_New will be used for allocation which will take care of writing the default vectorcall function pointer to the object (see ctx_type.c:1408).
  5. In case HPy_tp_new is not provided, we wrap the inherited tp_new function with hpyobject_new (see ctx_type.c:265) which takes care of that.

Restrictions

Misc

From a performance point of view, object creation should not be significantly slower (compared to CPython's vectorcall API) because if (1) the vectorcall protocol is not implemented, we just do an additional type flag check, and if (2) the protocol is implemented, we might do an additional write to the hidden field in case the user overwrites the default function.

I still did not write documentation about that. I will do in a follow-up PR.

fangerer commented 1 year ago

@hodgestar left some comments in the IRC channel. I'm posting them here for documentation:

Every time I look at the old C API for it, I go "arg" a lot. It feels more like a perfomance hack that got exposed than an API. However, I'm also not sure what to do about it.

@fangerer Do you have an important / good example use case for the per-instance vector call? What would prevent people who want per-instance calls from just adding their own C function pointer to their struct and doing it themselves?

It feels like we know that a better way is to have our "argument clinic-esque" API for JITs and similar, but that is a lot of work. :/

Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.

Would it be possible to remove HPy_VECTORCALL_ARGUMENTS_OFFSET from our API and, for example, make a new rule that one can always overwrite args[0] (i.e. pass the actual array instead of a pointer to the second element)?

fangerer commented 1 year ago

@hodgestar: Here are my answers:

Do you have an important / good example use case for the per-instance vector call?

I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data. The PEP says: _"Another source of inefficiency in the tpcall convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created." So, the real world example is "calls to classes"

Would it be possible to remove HPy_VECTORCALL_ARGUMENTS_OFFSET from our API and, for example, make a new rule that one can always overwrite args[0] (i.e. pass the actual array instead of a pointer to the second element)?

Sure and sounds good to me since it makes it very clear.

Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.

I'm not sure if we even need to take caution concerning compatibility with arg clinic. I think there are two aspects:

  1. Assume an extension author already implements the vector protocol using this PR and then we introduce arg clinic. Compatibility would mean that the extension doesn't need to be migrated. I think that's easily possible.
  2. We want to (internally) use the arg clinic calling machinery to call vectorcall functions. IMO, that is also possible since we just need to implement arg clinic in a way that it can call the vectorcall signature.

Or did I misunderstand your comment?

steve-s commented 1 year ago
Do you have an important / good example use case for the per-instance vector call?

I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data. The PEP says: "Another source of inefficiency in the tp_call convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created." So, the real world example is "calls to classes"

The author of nanobind asks for this in CPython stable ABI in here: https://discuss.python.org/t/ideas-for-forward-compatible-and-fast-extension-libraries-in-python-3-12. IIRC he mentioned somewhere that nanobind uses/can use this for all functions (my possibly wrong understanding: every function is a separate object with vectorcall).

Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility.

I think there is one more thing to vectorcall (and again, maybe I just misunderstand it :-)). Citing from PEP-590:

_Another source of inefficiency in the tpcall convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created. For a class cls, at least one intermediate object is created for each call in the sequence type.call, cls.new, cls.init.

wjakob commented 1 year ago

I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data.

I can give one example from nanobind: its function object dispatches calls to C++ using either a simple implementation (only positional arguments supported) or a complex implementation (with handling of default values, keyword arguments, variable argument count, etc.) that is significantly slower. When a new function object is created, it sets the appropriate vector call dispatcher based on the properties of the function.

mattip commented 1 year ago

This has conflicts. Does it replace #251?

fangerer commented 1 year ago

Does it replace #251?

@mattip: No, this PR is basically about supporting something like Py_tp_vectorcall_offset in the type spec (in other words: this is about how to implement the callee). PR #251 is about how to call something (like HPy_CallTupleDict).

As far as I got from the discussions here and in the dev calls: we are still not sure if this is the way to go. @hodgestar argued that the vectorcall protocol mostly looks like an exposed implementation detail but with a little bit of extra functionality (in particular, the fact that you can have different call function implementations per object).

I still think the changes in this PR make sense because of following reasons:

Anyway, before merging this, I would like to have more feedback. In particular from @antocuni .

This has conflicts.

They would be easy to resolve if we decide to merge this.

fangerer commented 1 year ago

Not done yet but pushed if people are interested on the progress and to trigger the tests.

fangerer commented 1 year ago

Big update on the PR. I've addressed most of @antocuni 's points. Here is a summary:

Some other remarks: