Optimise calling machinery

ronaldoussoren commented 3 years ago

The machinery for calling methods can be optimised. See also #350.

Python -> Objective-C:

the descriptor lookup currently peeks at the class on every attribute lookup, that is fairly expensive
- add a lookup cache to leaf classes: this changes the semantics slightly and is should be possible to disable, but avoid the whole MRO walk (see also PyObjC_FAST_BUT_INEXACT which speeds up resolving inherited methods)
- [NOPE] avoid calling PyObjCClassCheckMethodList when not necessary, this is an expensive operation and is not necessary when the attribute is already in the class dict (PyObjCClass... is no longer expensive, that was in an earlier version of PyObjC, avoiding that call doesn't change performance and complicates the code)
The libffi_caller function handles all kinds of method calls. It should be possible to add variants for common types of methods (either a simpler variant of libffi_caller that handles a limited subset of method types, or even some specialised method callers). [I have implemented a simpler variant of objc.function that mostly does this, sill using the generic mechanism and that is significantly faster; the full suggestion in this bullet is even faster but bloats the core bridge]
[DONE] Libffi_caller creates and destroys libffi context on every call, cache those in the callable object (this is a layering violation, but might help in performance) [for objc.function the fficif is already cached, for objc.selector this might be harder due to bound selectors; vectorcall. might help there]
Stare at CPython's attribute caching code and try to work with that for further performance gains (that should fix a lot of the performance difference in successful attribute lookup in steady state, there's currently a significant difference between the two).

Objective-C to Python:

Create simpler stubs for common method signatures (started work on this, not happy set about the difference)
[DONE] ~~method_stub should use vectorcall on Python 3.9:~~
- ~~Don't use PyList_* APIs to build argument vector, but use a C array instead (stack allocated)~~
- ~~Use PyObject_Vectorcall instead of PyObject_Call~~
[DONE (but differently)] ~~For 3.8 and earlier: Allocate a correctly sized PyTuple directly instead of first building a list and then converting to tuple~~
EDIT: this isn't correct. The closure/stub in ObjC classes currently dynamically looks up the Python method on every call, that's not really necessary. This might need some code to change the Python object stored in the close when the class is updated at runtime (which shouldn't happen a lot)

Generic

[DONE] ~~Enable LTO (couple of percent faster)~~
Look into PGO as well (using test suite to collect profiling information?)
Move more information into PyObjCMethodSignature objects, in particular
- sizeof, alignof information
- to_py, from_py functions
This could remove most uses of the helper functions for this (except for structs, arrays and the like), and hopefully that helps to remove some overhead and hence increase performance. But that needs to be tested!

ronaldoussoren commented 3 years ago

On my machine with default settings:

object description lookup     : 0.026
NSObject description lookup   : 0.127
NSArray description lookup    : 0.615

object description call       : 0.187
NSObject description call     : 1.069
NSArray description call      : 0.835

With PyObjC_FAST_BUT_INEXACT:

object description lookup     : 0.026
NSObject description lookup   : 0.127
NSArray description lookup    : 0.137   <-- FASTER

object description call       : 0.179
NSObject description call     : 1.051
NSArray description call      : 0.822

With LTO and -O3:

object description lookup     : 0.026
NSObject description lookup   : 0.121
NSArray description lookup    : 0.613

object description call       : 0.178
NSObject description call     : 1.044
NSArray description call      : 0.818

With experimental lookup cache:

object description lookup     : 0.026
NSObject description lookup   : 0.119
NSArray description lookup    : 0.129

object description call       : 0.184
NSObject description call     : 1.052
NSArray description call      : 0.823

With experimental lookup cache and PyObjC_FAST_BUT_INEXACT :

object description lookup     : 0.026
NSObject description lookup   : 0.117
NSArray description lookup    : 0.127

object description call       : 0.179
NSObject description call     : 1.054
NSArray description call      : 0.828

ronaldoussoren commented 3 years ago

A large part of the libffi support code is shared between objc.function and objc.selector, and the former is easier to isolate. My current plan is therefore to add some benchmarks for objc.function, then optimise the libffi support bits and when finally port that to objc.selector (and objc.IMP).

And finally look into selector specific optimisations as well (such as caching IMPs, although that requires some research to ensure that the cache is invalidated as needed). On first glance this may not be needed, the assembly code for a function calling an ObjC method just calls objc_msgSend and doesn't seem to contain some kind of cache.

ronaldoussoren commented 3 years ago

As a quick experiment I've implemented vectorcall support for objc.function and that appears to improve performance by a couple of percent (for a single argument function).

With this change (not yet committed):

python function call : 0.026 objc function call : 0.167

Without vectorcall the objc function call was about 0.175, the new version is about 5% faster (and basically for free). Next up is a stripped down version of PyObjCFFI_ParseArguments (for simple enough functions) and a variant of the vectorcall support function that uses it.

ronaldoussoren commented 3 years ago

The absolute best we can reach:

python3.9 Tools/pyobjcbench.py (master)pyobjc python function call : 0.027 objc function call : 0.022

This cuts out all overhead by adding a vectorcall implementation variant for functions with signature "double(*)(double)" and selecting that for functions with that signature in their typestring. Then hardcode argument conversion, and call the function without using libffi.

This makes is clear that we have significant overhead in argument parsing and/or the libffi code.

This won't end up as such in the final implementation for objc.function, but may get used for objc.selector (which is used a lot more and which would make the additional code overhead more acceptable with common enough signatures).

ronaldoussoren commented 3 years ago

This is more realistic:

function call          : 0.026
objc function call            : 0.076

This is with a change that recognises simple enough functions (limited number arguments, limited total size of arguments, no pass by reference, no blocks, no functions) and replaces the default vectorcall implementation by a simpler one. That implementation is both simpler and avoids a number of calls to PyMem_Malloc.

There's still room for further improvements, but this reduces the overhead of objc.function w.r.t. a builtin function by over 50%.

I'm still not committing, this patch needs further testing (and double checking that the bookkeeping is correct). I also have to think about restructuring some of PyObjC's internal testing because a lot of the tests for function calling will now exclusively use the shortcut path and not the full implementation [either add a build or runtime option to avoid using the optimised versions, or duplicate tests with a variant that uses the slow path due to a pass-by-reference input argument].

Most of the improvement is from the simpler implementation, restoring the PyMem_Malloc call for the argument buffer results in only slightly worse performance (but that's just one of several calls to PyMem_Malloc that are removed in the fast path):

python function call          : 0.027
objc function call            : 0.080

I need to check this, but expect that the simpler implementation does not use more stack than the full implementation, even with the argument buffer on the stack.

ronaldoussoren commented 3 years ago

See also #362

ronaldoussoren commented 3 years ago

"-flto" gives a tiny improvement:

python function call          : 0.027
objc function call            : 0.074

"-fvisibility=hidden" instead of "-fvisibility=protected" doesn't help, but would avoid exposing internals.

ronaldoussoren commented 3 years ago

Dropping use of libffi (just for a specific test case):

python function call          : 0.026
objc function call            : 0.064

The difference is probably not large enough to bother investiging further at this time.

ronaldoussoren commented 3 years ago

It might be interesting to play around with a variant that supports even less features (for example only a limited subset of types), which might allow inlining parts of objc_support.m.

ronaldoussoren commented 3 years ago

It might be interesting to play around with a variant that supports even less features (for example only a limited subset of types), which might allow inlining parts of objc_support.m.

Current plan is to experiment with stuffing often used bits from objc_support.m into method-signature.m, that way the big case statements don't have to be used (at a cost of slightly higher memory usage). It is far from clear at this point that this will actually help though.

ronaldoussoren commented 3 years ago

With vectorcall for native selectors and an implementation for "simple" signatures:

object description call       : 0.184
NSObject description call     : 0.975
NSArray description call      : 0.791

The improvement is less than I'd hoped for, I guess I need to add some testing code to determine "native" calling speed.

Interestingly enough I get slightly better performance by not dropping the GIL:

object description call       : 0.181
NSObject description call     : 0.936
NSArray description call      : 0.758

And @try { ... } around the call has some overhead as well:

object description call       : 0.181
NSObject description call     : 0.930
NSArray description call      : 0.757

The improved performance for not dropping the GIL might be interesting enough to introduce metadata for (not dropping the GIL is not save in general, some calls will effectively block for a long time).

schriftgestalt commented 3 years ago

Thanks for working on this. I wonder if it would be possible to see in a profile, what code path is used? So that I might be able to adjust the signature of those methods to be sure to use the fast path?

And would it make sense to add a flag that would allow to switch to a very fast but not as save code path. I my case, those methods are called (a lot) in the drawRect: of the main view and so every millisecond counts.

ronaldoussoren commented 3 years ago

Thanks for working on this. I wonder if it would be possible to see in a profile, what code path is used? So that I might be able to adjust the signature of those methods to be sure to use the fast path?

Currently the fast path is for methods are functions with a limited amount of arguments (max 8) and where none of them (or the result) require special handling (no blocks, no pass-by-reference arguments, ...).

One of the things I want to add later on is an option to record statistics about signatures to help me pick a set of signatures that are worthwhile to further optimise.

And would it make sense to add a flag that would allow to switch to a very fast but not as save code path. I my case, those methods are called (a lot) in the drawRect: of the main view and so every millisecond counts.

I might do that, but preferably in a way that allows me to specify in metadata which methods are save w.r.t. such shortcuts and apply them automatically. But at this time I'm still hoping that I can avoid that.

Note that all statistics in this issue are for calling ObjC from Python. Once I've merged a first set of optimisations for that I'll do something similar for calling from Python to ObjC, which would help your drawRect: use case.

ronaldoussoren commented 3 years ago

Function indirection through pointers is clearly somewhat expensive:

object description call       : 0.183
NSObject description call     : 0.838
NSArray description call      : 0.661

This is clearly faster and the change w.r.t. previous attempts is to introduce a second vectorcall implementation for selectors that is hardcoded to call the "simple" libffi caller instead of calling it through a function pointer.

I might end up with 3 vectorcall variants:

Hardcoded "simple" libffi caller
Hardcoded "full" libffi caller
Indirection through a function pointer for special cases (mostly in framework bindings)

schriftgestalt commented 3 years ago

I think my case can benefit from improving both call directions. To give a real world example: I have classes defined in python like this:

class MyClassName(NSObject):
    def drawSomethingWithOptions_(self, optionsDict):
        scale = _controller.scale()
        rect = NSMakeRect(optionsDict["X"] * scale, optionsDict["Y"] * scale, optionsDict["Width"] * scale, optionsDict["Height"] * scale)
        bezierPath = NSBezierPath.bezierPathWithRect_(rect)
        bezierPath.fill()

And I call this from objC:

- (void) drawRect:(NSRect)dirtyRect {
    [_drawDelegate drawSomethingWithOptions:_drawOptions];
}

This is a very simplified example. And some of the callbacks have a (NSError**)error parameter.

ronaldoussoren commented 3 years ago

NSError** arguments are something I want to look into (or rather the generic variant: pass-by-reference of single values). I think it is possible to fold support for that into the simple variant, but I haven't tried this yet. A major difference between the regular and simple variants is that the latter does less memory allocations and that results on some restrictions on what I can do in the simple variant. But "pointer to a single value" should fit into the pattern.

schriftgestalt commented 3 years ago

One thing I notice that my problems are much worse on python3. Are there changes in python3 that would suggest a behavior like this or might it be caused by the way how I init the runtime?

ronaldoussoren commented 3 years ago

changing the argument list builder in the method stub for calling from ObjC to Python to preallocate a tuple of the right size instead of incrementally growing a list and converting that to a tuple is marginally faster (about 4% for that micro benchmark).

not committing this right now because I get a crash when testing blocks.

ronaldoussoren commented 3 years ago

One thing I notice that my problems are much worse on python3. Are there changes in python3 that would suggest a behavior like this or might it be caused by the way how I init the runtime?

I don't know, the code for Python 2 and 3 was pretty much the same for PyObjC. The major difference is slightly stricter type checking in some places, non of which should be on the fast path. That said, I've never paid much attention to speed because the bridge was fast enough for what I do with it.

ronaldoussoren commented 3 years ago

The current simple changes to the method stub increase performance in the micro benchmark for calling methods from ObjC by about 8% compared to 7.3 (on my M1 laptop with Python 3.9).

ronaldoussoren commented 3 years ago

I've started merging work into the repository, the current improvement is shown below (python 3.10, x86_64 VM running BigSur). The VM is a fairly noise environment, but this looks promising.

Merged are:

vectorcall support in various callable
using vectorcall to call from C to Python
"simple" vectorcall variant
cache the ffi_cif structure between calls for "simple" calls
use -flto when building

test name                                | 7.3              | 8.0b1           
-----------------------------------------+------------------+------------------
object description lookup                | 0.040            | 0.038 (-5.0%)   
NSObject description lookup              | 0.198            | 0.185 (-6.6%)   
NSArray description lookup               | 0.873            | 0.879 (+0.7%)   
object description bound call            | 0.302            | 0.304 (+0.7%)   
NSObject description bound call          | 2.097            | 1.676 (-20.1%)  
NSArray description bound call           | 1.504            | 1.223 (-18.7%)  
object description unbound call          | 0.326            | 0.332 (+1.8%)   
NSObject description unbound call        | 2.523            | 2.160 (-14.4%)  
NSArray description unbound call         | 2.546            | 2.436 (-4.3%)   
object description IMP call              | 0.322            | 0.313 (-2.8%)   
NSObject description IMP call            | 1.660            | 1.581 (-4.8%)   
NSArray description call                 | 1.070            | 1.051 (-1.8%)   
python function call                     | 0.052            | 0.053 (+1.9%)   
objc function call                       | 0.289            | 0.155 (-46.4%)  
call no-args from objc                   | 9.275            | 8.048 (-13.2%)

ronaldoussoren commented 3 years ago

I've looked into PGO for a couple of minutes, but need to find a tutorial for that. My first attempt resulted in a failed build using the profile data because the profile data was claimed to be out-of-date.

ronaldoussoren commented 3 years ago

One thing I want to look into soonish is the code that walks the MRO looking for methods. I have two, currently disabled, options in the PyObjC code that should speed things up significantly, but at the cost that the __dict__ of the Python proxy classes no longer match equivalent structure in the ObjC runtime.

I need to determine if that affects correctness of Python code that doesn't introspect __dict__ before I enable these options, in particular when ObjC categories are involved and/or class swiffling (as used by observations).

The goal is to minimise the amount of times that the code has to look at the objc runtime after an initial lookup.

The results below show that enabling these options is worth it performance wise, although there's still a pretty large difference with looking up names in regular python classes.

FAST_BUT_INEXACT:

test name                                | 7.3              | 8.0b1           
-----------------------------------------+------------------+------------------
object description lookup                | 0.039            | 0.038 (-2.6%)   
NSObject description lookup              | 0.194            | 0.187 (-3.6%)   
NSArray description lookup               | 0.867            | 0.197 (-77.3%)  
object description unbound call          | 0.323            | 0.326 (+0.9%)   
NSObject description unbound call        | 2.525            | 2.139 (-15.3%)  
NSArray description unbound call         | 2.675            | 1.531 (-42.8%)

LOOKUP_CACHE:

test name                                | 7.3              | 8.0b1           
-----------------------------------------+------------------+------------------
object description lookup                | 0.039            | 0.041 (+5.1%)   
NSObject description lookup              | 0.194            | 0.188 (-3.1%)   
NSArray description lookup               | 0.867            | 0.180 (-79.2%)  
object description unbound call          | 0.323            | 0.329 (+1.9%)   
NSObject description unbound call        | 2.525            | 2.101 (-16.8%)  
NSArray description unbound call         | 2.675            | 1.574 (-41.2%)

Both options:

test name                                | 7.3              | 8.0b1           
-----------------------------------------+------------------+------------------
object description lookup                | 0.039            | 0.039           
NSObject description lookup              | 0.194            | 0.181 (-6.7%)   
NSArray description lookup               | 0.867            | 0.176 (-79.7%)  
object description unbound call          | 0.323            | 0.324 (+0.3%)   
NSObject description unbound call        | 2.525            | 2.287 (-9.4%)   
NSArray description unbound call         | 2.675            | 1.679 (-37.2%)

ronaldoussoren commented 3 years ago

One thing to look into for speeding up calling from ObjC to Python: imp_implementationWithBlock. This function is available from macOS 10.7, which means it can be used without compile-time or runtime guards.

This could be used to create IMP's bound to a Python function without going through libffi. This can be used for a number of common method signatures to remove the overhead of libffi (assuming this function is more efficient).

ronaldoussoren commented 3 years ago

The performance difference between the "classic" call_from_objc API (using libFFI) and the alternative using imp_implementationWithBlock is not quite clear, the latter appears to be slightly faster but the difference is very close.

The new API has the advantage not requiring changes to the framework bindings, in the old API the method implementation has no access to the PyObjCMethodSignature for the selector and that needs to change to be able to handle APIs returning an "id" correctly (due to "already_retained" and "already_cfretained"). Fixing that is possible, but so far it doesn't seem worthwhile to make that change.

ronaldoussoren / pyobjc

Optimise calling machinery #359