[FEAT] Should include more comprehensive benchmarking (primarily performance)

pybind / pybind11

Seamless operability between C++11 and Python

https://pybind11.readthedocs.io/

Other

15.82k stars 2.12k forks source link

[FEAT] Should include more comprehensive benchmarking (primarily performance) #2760

Open EricCousineau-TRI opened 3 years ago

EricCousineau-TRI commented 3 years ago

Motivation

At present (b7dfe5cc), pybind11 proper only benchmarks compile-time and and artifact size for one given test setup (which tests arguments, simple inheritance, but that's about it, I think); the results of which can be seen here: https://pybind11.readthedocs.io/en/stable/benchmark.html https://github.com/pybind/pybind11/blob/v2.6.1/docs/benchmark.rst

However, it may difficult to objectively and concretely judge the performance impact of a PR, and weigh that against the value of the feature / issue resolution. Generally, benchmarking is done on an ad-hoc basis (totes works, but may make it difficult for less creative people like myself ;)

Primary motivating issues / PRs:

Secondary:

2322
1227 / #1825
https://github.com/pybind/pybind11/pull/693#issuecomment-282592947
376

Fuzzy Scoping + Steps

[ ] Establish a baseline benchmarking setup that touches on the core "hotpath" features for performance. Only track metrics for a given version of pybind11 (out of scope: other binding approaches)
- Initialization time (e.g. dlopen - ish stuff, pybind11 internals upstart, binding registration, ...)
- Run time (function calls, type conversions, casting, etc.)
- Comparison "axes": later CPython versions, pybind11 versions / PR branches / forks
- OS: For now, just Linux ('cause that's all I use ;)
[ ] Add additional metrics (e.g. memory leaks / usage, do a redux on compile-time + size)
[ ] Ideally, provide guidance on what pybind11 finds the most important (how to weigh compile-time, size, speed, memory, etc.)
[ ] (Stretch) Possibly compare against other approaches (Boost.Python, Cython, SWIG, cppyy / Julia / LLVM-esque stuff, etc.)

Given that performance benchmarks can be a P.I.T.A. (e.g. how to OS + interrupts, hardware capacity / abstractions, blah blah), ideally decisions should be made about relative performance on the same machine. Ideally, we should also publish some metrics for a given config to give people a "feel" for the performance, as was done for compile time.

Suggested Solution Artifacts

Identify code repository
- Perhaps github.com/pybind/pybind-benchmarks ?
- Alt.: In-tree (which may make it hard to compare across versions...)
Identify good tooling for performance benchmarking
- @henryiii suggested pytest-benchmark

@wjakob @rwgk @rhaschke @YannickJadoul @bstaletic @henryiii @ax3l Can I ask what y'all think? Is this redundant w.r.t. what we already have?

rwgk commented 3 years ago

This code was used for Google-internal micro-benchmarking: https://drive.google.com/file/d/1EOGU_A28oBvzoLwdmo2RpImjyL6bv2rl/view?usp=sharing The link is accessible only to select people. pybind11 maintainers have access already. The original author is @kkimdev. He generously gave us permission to reuse what's useful for our purposes (under the OSS pybind11 org). I think we will need major adaptions for the code to work outside the Google environment.

rwgk commented 3 years ago

Here is a doc with background and results of the micro-benchmarking: https://docs.google.com/document/d/1UieJ9WZ9YVJLt_EsIYj4Ahw8OIb3fH6mOIAPQRzj46k/ The link is accessible only to the same group of people. The doc is meant to inform our work on benchmarks. Please do not make copies. As before, the original author is @kkimdev.

kkimdev commented 3 years ago

A benchmarking tip: I think combining C++ sampling profiler with Python benchmark is extremely useful. My latest internal benchmark script (not the above one) supports --run_pprof=True flag that runs pprof sampling profiler (but not sure if the open source version is as good as the internal, and probably there are other good open source C++ sampling profilers as well) and reports it. The flame graphs in my doc are from that.

YannickJadoul commented 3 years ago

This is a bit annoying if external contributors would feel a calling to help out.

Future people without access, contributions and ideas are still welcome; sharing that code is up to the authors, but ideas can be discussed :-)

rwgk commented 3 years ago

Yes, sorry, but getting clearance to make this info fully public is likely quite a bit of trouble. But we can add interested people, and we can summarize or report what we learn here.

On Wed, Dec 30, 2020 at 14:57 Yannick Jadoul notifications@github.com wrote:

This is a bit annoying if external contributors would feel a calling to help out.

Future people without access, contributions and ideas are still welcome; sharing that code is up to the authors, but ideas can be discussed :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pybind/pybind11/issues/2760#issuecomment-752783067, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFUZAAKRWQM57UBQPTANOLSXOV6HANCNFSM4VOVTTTA .

YannickJadoul commented 3 years ago

No worries, @rwgk, I completely understand! Just trying to make sure that interested users/contributors (and maybe future maintainers ;-) ) aren't scared off :-)

EricCousineau-TRI commented 3 years ago

Also looking at the following to gauge how to possibly benchmark the CPython / PyPy low-level bits - for stuff like #2050-type stuff, in addition Kibeom's suggestion for pprof:

wlav commented 3 years ago

Since cppyy was mentioned: I use pytest-benchmark as well (see: https://bitbucket.org/wlav/cppyy/src/master/bench/). It's hard to write good benchmarks, though, as features and defaults differ. For example, releasing the GIL by default is costly for micro-benches (and for large, costly, C++ functions, the bindings overhead doesn't matter). Another big expense is object tracking for identity matching on returns, which not all binders do (and is useless for micro-benches).

For real high performance, the processor matters as well. For example, PyPy has guards on object types in its traces, based on which a specific C++ overload selected by cppyy will be compiled in. On a processor with good branch prediction and a deep out-of-order execution queue, that overhead will not show up in wall clock time (assuming no hyper-threading, of course), but it will be measurable on a processor with simpler cores.

When sticking to CPython only, consider also that CFunction objects have seen a massive amount of support in the form of specialized tracks through the CPython interpreter since release 3. (This is what makes SWIG in "builtin" mode, not the default, absolutely smoke everything else.) Only since 3.8 have closures seen some love, with the API stabilized in 3.9. There's a 30% or so reduction in call overhead in there somehow (for cppyy), but it's proving to be quite a lot of work to implement.

That last point, CPython internal developments and the need to track/make use of them, is also why I'd be interested if the proposed benchmarks end up being made public. The only way of measuring the usefulness of such changes is by having a historic record to compare against and setting that up is quite some effort (esp. when switching development machines regularly).

EricCousineau-TRI commented 3 years ago

See also: Benchmarking by @sloretz for usability, basic RAM analysis, etc., comparing a popular binding options (boost.python, cffi, cppyy, raw CPython, pybind11, SIP, SWIG) for the rclpy library: https://github.com/ros2/rclpy/issues/665 https://docs.google.com/document/d/1BsTlxgzAkpX-5Q7FW6_Hyw1fszDEQef_OEvYmdnrBrE/edit#

Wim, not sure if this is what you meant by improvements in 3.8, but Shane also pointed to possible optimizations via PEP 590: Vectorcall (I dunno if pybind11 already leverages this directly but I see it crop up in this search: https://github.com/pybind/pybind11/issues?q=vectorcall+)

wlav commented 3 years ago

Yes, I was referring to vectorcall. The problem with its implementation is that it creates several distinctive paths, but then squeezes them through a single call interface. The upshot is very branchy code, more memory allocations to track, etc. Not the end of the world, but in the particular case of cppyy, certain code paths such as the handling of self were done rather differently from what the PEP 590 folks had in mind, so that bolting on vectorcalls as-is leads to unmaintainable code. Hence, I'm rewriting the calling code wholesale to fit vectorcalls better (simplifying memory tracking etc.) and this is proving quite a bit of work, but well worth it (already have a 20% reduction in call overhead, so am confident to hit that 30% when all is said and done).

Not sure about PyBind11, but with its greater call overhead, the benefits will be far less and there is the additional problem of requiring an extra pointer data member per bound object, so vectorcalls do come with a clear memory cost that can only be worth it if the CPU gains are big enough. But then, if the C++ side does substantial work in each call, then the call overhead matters little to nothing (see e.g. the benchmarking text you quote; myself, I benchmark mostly with empty functions), thus leaving that extra pointer as pure waste.

(Aside, the way I currently see it, storing the dispatch pointer as data member in the proxy is something that's only useful for C, not for C++.)

EricCousineau-TRI commented 3 years ago

From @wjakob - use perf to count instructions.

wlav commented 3 years ago

I'd be inclined to disagree (we were working with Levinthal back in the day when perf was being developed). The problem with Python code is that it is very branchy and consists mostly of pointer-chasing, so instructions (even when restricted to retired ones) give you a very muddy picture, as there is lots of out-of-order execution and (micro-)benchmarks will have branching predictability that can differ significantly from real applications.

hidmic commented 3 years ago

FYI @EricCousineau-TRI. I've spent some time pondering on this, assuming that the answer to

Should (pybind11) include more comprehensive benchmarking (primarily performance)

is yes. There are many tools out there to write benchmarks, to profile builds, to profile execution, to visualize results. What I think we are lacking is a definition for performance impact, though I ignore the conclusions to which the TensorFlow team has arrived to.

However, it may difficult to objectively and concretely judge the performance impact of a PR, and weigh that against the value of the feature / issue resolution

I'll focus on evaluating the impact of a code change, where impact is defined as a sequence of predefined metrics. To judge that impact or weigh it against the (sometimes subjective) value of a feature / issue resolution is not straightforward in the general case. Even if the net effect of a change is observable, comparing dissimilar metrics requires extra decision making (barring scalarization i.e. sum-of-products with suitable weights).

I'll further assume that new dimensions or axes such as:

CPython versions, pybind11 versions / PR branches / forks

as well as different operating systems and processor architectures can always be added by repeatedly evaluating the same metrics for different build-time and/or run-time configurations.

OS: For now, just Linux

That's fair, but I'll keep some (prudential?) bias towards cross-platform solutions.

To the best of my knowledge, it is generally not possible to fully characterize a piece of software in isolation (e.g. to have a performance baseline that's decoupled from downstream use cases and the infrastructure it runs on). @wlav brought up several good points regarding the difficulties of writing generally applicable benchmarks. Thus, it is perhaps easier to define metrics as a measurement and an associated experiment to perform that measurement.

A few good measurements were suggested above. Improving on that list, I think that the following measurements paint a comprehensive picture (though I'm sure I'm missing a few):

Build-time measurements (to be performed separately by the build system or a regular test)
- Total build time, in seconds. Can be externally measured, as benchmark.py already does, or can be provided by the compiler (see gcc -ftime-report flag, or clang 12 -fproc-stat-report flag).
- Peak build RSS, in MBs. Relevant when compiling libraries that do heavy template metaprogramming. Can be externally measured, or can be provided by the compiler (see gcc -fmem-report flag).
- Artifact size, in MBs. Can be externally measured.
- Applicable optimizations count. Some code is easier to optimize (for speed, for size). Can be provided by the compiler (see gcc -fopt-info flag).
Load-time measurements (outside critical path, benchmarks with instrumentation profiler should be OK)
- Total loading time, in seconds. Separate userspace time and kernelspace time if possible.
- Instruction count per function.
- L1/L2/L3 cache hits/misses per function.
- Execution time per function, in seconds
Run-time measurements (within critical path, benchmarks with non-intrusive sampling profiler are probably best)
- Total execution time, in seconds. Separate userspace time and kernelspace time if possible.
- Execution time to total execution time ratio per function.
- Total L1/L2/L3 cache hits/misses
- Peak execution RSS, in MBs

To perform those measurements, I think that, to begin with, we can re-purpose a subset of pybind11's test suite as micro-benchmarks (or write new ones) that exercise the features that we believe are in the "hot path", but eventually we'll need real world examples as well (e.g. a multi-body simulation using pydrake, multi-process high-bandwidth communication using rclpy, a tensorflow classifier, etc.). This is in line with pyperformance, which includes benchmarks for non-core libraries. Results will be scattered but (hopefully more) comprehensive.

As for the tooling, I think pytest-benchmark should be the benchmark runner, hands down. It makes performance CI checks straightforward. But it'll need extension to perform all of the aforementioned measurements. For that, we can couple it with the timemory toolkit by re-purposing benchmarks' timer functions and using their extra_info attribute.

This toolkit is fairly new, but it offers cross-platform CMake macros as well as C++ and Python APIs to build-time and run-time instrumentation, hardware counters, vendor specific profilers, and more. It has two downsides though: it uses pybind11 itself, which means we have to build it from source and/or mask the pybind11 symbols for non ABI compliant versions, and it lacks a cohesive cross-language call graph (see py-spy, though arguably this can be contributed to timemory, and it is relevant for manual performance analysis only).

Thoughts? I haven't had time to check https://bitbucket.org/wlav/cppyy/src/master/bench/, but I will. I'd also appreciate @kkimdev insights.

EricCousineau-TRI commented 3 years ago

Sounds great! Some minor questions:

Peak build RSS

My naive Google/DDG-fu couldn't resolve my understanding of this term. Do you have good ref. for me to understand this?

As for the tooling, [...] It makes performance CI checks straightforward. [...]

How would you imagine CI checks being performed? If they are costly (e.g. >=15min?), my suggestion is that they be relegated to "on-demand" builds and semi-infrequent measures (weekly). From there, metrics at a coarse level can be tracked, then fine-grained inspection can be done via bisecting across a specific increase could be done, rather than indexing a "huge" corpus of data.

Thoughts? (is that already a said feature of pytest-benchmark or timemory?)

This toolkit is fairly new, [...]

I briefly perused docs, but not sure if I saw concrete recommendation on data aggregation / metrics visualization. I saw roofline and "data tracking" components, but not sure if it's for weeks-long data aggregation.

Do you have thoughts on how to review these changes? (one option is to use external free service; heck, even something like TensorBoard or https://wandb.ai/ could be used to track metrics flexibly w/ sufficient traceability)

[From timemory page] Adheres to C++ standard template library paradigm of "you don't pay for what you don't use"

Hehe, mayhaps "you shouldn't pay for what you don't use" is a more apt description in most cases :P

hidmic commented 3 years ago

My naive Google/DDG-fu couldn't resolve my understanding of this term. Do you have good ref. for me to understand this?

In all cases, RSS stands for resident set size. During builds, perhaps inspecting the total virtual memory requirements may be revealing too.

How would you imagine CI checks being performed? If they are costly (e.g. >=15min?), my suggestion is that they be relegated to "on-demand" builds and semi-infrequent measures (weekly). From there, metrics at a coarse level can be tracked, then fine-grained inspection can be done via bisecting across a specific increase could be done, rather than indexing a "huge" corpus of data.

Agreed. I did a poor job above separating metrics by granularity (and focused on their feasibility instead). I definitely expect most automated benchmarks to perform coarse measurements and be run as such. A CI check for a PR'd patch simply amounts to a faster regression check, and for that we need simple yet representative measurements for each metric (e.g. benchmarks for the hottest or least cache friendly execution paths). It will require empirical evidence, and a process to update these measurements as the system evolves, so not the first thing in the queue.

Thoughts? (is that already a said feature of pytest-benchmark or timemory?)

No, and no. timemory provides the tooling, pytest-benchmark allows benchmarks to join the test suite (by comparing against given previous benchmark results for a single metric), but neither address nor advise on metric granularity or traceability.

I briefly perused docs, but not sure if I saw concrete recommendation on data aggregation / metrics visualization. I saw roofline and "data tracking" components, but not sure if it's for weeks-long data aggregation.

timemory outputs hierarchical data structures that can be pulled up using pandas dataframes and analyzed using hatchet. There's support for basic visual representations, like flamegraphs, but otherwise it's on the user to operate on data and draw conclusions.

Do you have thoughts on how to review these changes? (one option is to use external free service; heck, even something like TensorBoard or https://wandb.ai/ could be used to track metrics flexibly w/ sufficient traceability)

+1 to an external service. Pandas is quite popular, shouldn't be hard to find a good, cheap service that can process them. Commit hashes can be used for tagging. Still, that doesn't address how to review changes in performance. And neither do I :sweat_smile:. Quoting myself:

I'll focus on evaluating the impact of a code change, [...]. To judge that impact or weigh it against the [...] value of a feature / issue resolution is not straightforward in the general case. Even if the net effect of a change is observable, comparing dissimilar metrics requires extra decision making [...].

I don't know, TBH. We need data. I guess it's likely we'll end up focusing on trends in global metrics (like total execution time per benchmark and per suite) as opposed to fluctuations in local metrics (like cache misses in a function). Or maybe not. Maybe a few local metrics for a handful of functions shape everything else and we ought to be paying attention to them instead.

wlav commented 3 years ago

timemory and roofline are actually local products. :) And roofline isn't a useful metric for binders, as they explicitly sit in between two separate memory allocator runtimes, so most of their work consists of pointer chasing, so even as roofline will chastise them for that, it can't be helped.

As to @hidmic's last point, yes, there are a series of "obvious" functions that matter to almost everyone (resolve an overload, access an int, etc.), but then there is a range of specific use cases that can make a big performance difference for specific projects, but fail to run optimally b/c either these haven't been identified yet as candidates for an optimized code path, or some confluence of features happens to interact in a sub-optimal way.

Identifying those cases is the hard bit, not writing the optimized code paths, which is why a benchmark suite that is run over multiple binders would be useful for me, even as the original post has "other binders" only as a stretch goal: if one binder is faster than another on some benchmark, it is an obvious indicator that some optimization is missed in the slower binder and such information is a lot easier to act on than trying to optimize a given benchmark where it's not known a priori whether (further) optimization is even possible.

Aside, PyPy development often progresses like that, too: someone compares pypy-c performance to python performance for their specific code and finds it lacking. They report it (hopefully!), code gets analyzed and the JIT grows a new optimization path. Here's a recent example: https://www.nature.com/articles/s41550-021-01342-y

wlav commented 3 years ago

Just an update on my comment from Jan 22 above: cppyy now supports vectorcall (in repo; not yet released). Yes, it can achieve 35% performance improvements in some cases, but that isn't true across the board, with SWIG with -builtin still outperforming in several, in particular simpler, cases (default SWIG is way slower; -builtin creates less functional bindings, meaning it is not for everyone), so there is still more detailing (and refactoring) to do. Hopefully that will go faster than getting support in in the first place; the changes have been very invasive and especially memory management of call arguments became a lot more difficult, but I now have several helpers that work across Python versions to make it manageable.

rwgk commented 3 years ago

Better late but never ... here are results from some benchmarking I did back in February 2021:

https://docs.google.com/presentation/d/1r7auDN0x-b6uf-XCvUnZz6z09raasRcCHBMVDh7PsnQ/edit#slide=id.gbe9d5f50d2_0_11

I was comparing performance of different holder types at the time (not too much to report), but was surprised by this finding (copied from the slide):

Limited profiling main observation: ~1/3 of runtime is in get_type_info.

jblespiau commented 3 years ago

I investigated why pybind11 was slow to e.g. create objects, or cast objects to the C++ type. Long story short, skipping complex logics, and just doing what's needed, we can get significantly faster.

To know whether it was the detail::instance memory layout, or the actual code, i implemented templated function that does just what is needed. For example, to get back the PyType that has been associated to a type T, there is no need to do a runtime lookup every time, and we can do it once and store it into a static value:

template <typename T>
bool fast_type_check(handle h) {
  static PyTypeObject* expected_type = []() {
    const std::type_info& ti = typeid(T);
    detail::type_info* internal_ti =
        detail::get_global_type_info(std::type_index(ti));
    return internal_ti->type;
  }();

The cast is then:

template <typename T>
T* reinterpret_cast_unchecked(handle h) {
  return reinterpret_cast<T*&>(
      reinterpret_cast<detail::instance*>(h.ptr())->simple_value_holder[0]);
}

NOTE: Of course, this only works for a C++ type wrapped, without multiple inheritance, or inheritance of a C++ type from Python, etc. But it's reasonable to expect this use-case to be fast, and more complex features to not always be executed when not needed.

Generally speaking, whether we can perform many of these lookup only once depends how much dynamic stuff is done by the user. I expect types to be created at start-up, and, if they are created later, they won't be used until they are created, and they won't be destroyed. In that case, doing these retrieval only once make sense. I think it should be the default, and, if one really want fully dynamic type_casters that always perform a lookup, they should build their own (or we template it with a true/false).

We can get significantly faster, close to the raw C API performance, by only doing what is needed and nothing more.

Creating 1000000 objects from Python
Raw API took 0.25s
Pybind11 FAST took 0.24s
Pybind11 took 0.40s

Accessing the C++ object from the Python wrapped object
Raw C API cast took 0.14s
Pybind11Fast took 0.14s
Pybind11 took 0.19s

Accessing many C++ objects from the Python wrapped objects
Raw C API cast took 1.14s
Pybind11 Fast took 1.31s
Pybind11 took 6.37s

Creating many C++ objects and returning them to Python
Raw C API cast took 1.55s
Pybind11 FAST took 1.27s
Pybind11 took 4.38s

I am comparing directly storing a struct, with storing a unique_ptr of the struct, but it's easy to adapt.

cast.cc.TXT cast_test.py.TXT

So I guess we could be rewriting or adding new type-casters/templated function which are simpler.

wlav commented 3 years ago

https://docs.google.com/presentation/d/1r7auDN0x-b6uf-XCvUnZz6z09raasRcCHBMVDh7PsnQ/edit#slide=id.gbe9d5f50d2_0_11

As that document points to the dispatcher loop as the problem (which is also one of the slowest parts in SWIG, so it's not necessarily completely solvable with code generation): that's a perfect place for memoization based on the argument types. In CPython, cppyy does this explicitly, in PyPy, this is implicitly done by the JIT (types are guarded at the beginning of a JITed trace). If the type conversion of arguments of the memoized call fails, the run-time falls back on the normal, slower, overload resolution.

NOTE: Of course, this only works for a C++ type wrapped, without multiple inheritance, or inheritance of a C++ type from Python, etc.

Yes; and these are still static properties. In cppyy, a flag is set for these cases and an optimized path is chosen, where possible, at run-time based on that flag. (The equivalent of the code above in cppyy is skipping the offset calculation.)

jblespiau commented 3 years ago

Just to know what it would take to push some of the performance improvements. There are some changes which I think are beneficial, but are breaking the API (so, either we add new methods, and deprecate the current ones, which is ugly, or we increment a major version).

For example, making pybind11::isinstance not get the type each time, speeds up this method a lot.

template <typename T, detail::enable_if_t<!std::is_base_of<object, T>::value, int> = 0>
bool fast_isinstance(handle obj) {
  static handle type = []() {
    handle type = detail::get_type_handle(typeid(T), false);
    return type;
  }();
  return isinstance(obj, type);
}

instead of the current

template <typename T, detail::enable_if_t<!std::is_base_of<object, T>::value, int> = 0>
bool isinstance(handle obj) { return detail::isinstance_generic(obj, typeid(T)); }

However, it assumes that the user won't dynamically add the type after this is called, as it would invalidate the cache.

So how does the governance work for such breaking changes? I am wondering if there is a way to get them in, writing an email with an exact list of proposals, with the benchmark showing the improvements, but I am not sure there is a formal process to get the owners to decide on these breaking changes. Can someone share how the pybind11 deal with such breaking changes proposals?

pybind / pybind11

[FEAT] Should include more comprehensive benchmarking (primarily performance) #2760

Motivation

1333

2322

1227 / #1825

376

Fuzzy Scoping + Steps

Suggested Solution Artifacts