Open EricCousineau-TRI opened 3 years ago
This code was used for Google-internal micro-benchmarking: https://drive.google.com/file/d/1EOGU_A28oBvzoLwdmo2RpImjyL6bv2rl/view?usp=sharing The link is accessible only to select people. pybind11 maintainers have access already. The original author is @kkimdev. He generously gave us permission to reuse what's useful for our purposes (under the OSS pybind11 org). I think we will need major adaptions for the code to work outside the Google environment.
Here is a doc with background and results of the micro-benchmarking: https://docs.google.com/document/d/1UieJ9WZ9YVJLt_EsIYj4Ahw8OIb3fH6mOIAPQRzj46k/ The link is accessible only to the same group of people. The doc is meant to inform our work on benchmarks. Please do not make copies. As before, the original author is @kkimdev.
A benchmarking tip: I think combining C++ sampling profiler with Python benchmark is extremely useful. My latest internal benchmark script (not the above one) supports --run_pprof=True
flag that runs pprof sampling profiler (but not sure if the open source version is as good as the internal, and probably there are other good open source C++ sampling profilers as well) and reports it. The flame graphs in my doc are from that.
This is a bit annoying if external contributors would feel a calling to help out.
Future people without access, contributions and ideas are still welcome; sharing that code is up to the authors, but ideas can be discussed :-)
Yes, sorry, but getting clearance to make this info fully public is likely quite a bit of trouble. But we can add interested people, and we can summarize or report what we learn here.
On Wed, Dec 30, 2020 at 14:57 Yannick Jadoul notifications@github.com wrote:
This is a bit annoying if external contributors would feel a calling to help out.
Future people without access, contributions and ideas are still welcome; sharing that code is up to the authors, but ideas can be discussed :-)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pybind/pybind11/issues/2760#issuecomment-752783067, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFUZAAKRWQM57UBQPTANOLSXOV6HANCNFSM4VOVTTTA .
No worries, @rwgk, I completely understand! Just trying to make sure that interested users/contributors (and maybe future maintainers ;-) ) aren't scared off :-)
Also looking at the following to gauge how to possibly benchmark the CPython / PyPy low-level bits - for stuff like #2050-type stuff, in addition Kibeom's suggestion for pprof
:
Since cppyy was mentioned: I use pytest-benchmark as well (see: https://bitbucket.org/wlav/cppyy/src/master/bench/). It's hard to write good benchmarks, though, as features and defaults differ. For example, releasing the GIL by default is costly for micro-benches (and for large, costly, C++ functions, the bindings overhead doesn't matter). Another big expense is object tracking for identity matching on returns, which not all binders do (and is useless for micro-benches).
For real high performance, the processor matters as well. For example, PyPy has guards on object types in its traces, based on which a specific C++ overload selected by cppyy will be compiled in. On a processor with good branch prediction and a deep out-of-order execution queue, that overhead will not show up in wall clock time (assuming no hyper-threading, of course), but it will be measurable on a processor with simpler cores.
When sticking to CPython only, consider also that CFunction objects have seen a massive amount of support in the form of specialized tracks through the CPython interpreter since release 3. (This is what makes SWIG in "builtin" mode, not the default, absolutely smoke everything else.) Only since 3.8 have closures seen some love, with the API stabilized in 3.9. There's a 30% or so reduction in call overhead in there somehow (for cppyy), but it's proving to be quite a lot of work to implement.
That last point, CPython internal developments and the need to track/make use of them, is also why I'd be interested if the proposed benchmarks end up being made public. The only way of measuring the usefulness of such changes is by having a historic record to compare against and setting that up is quite some effort (esp. when switching development machines regularly).
See also: Benchmarking by @sloretz for usability, basic RAM analysis, etc., comparing a popular binding options (boost.python, cffi, cppyy, raw CPython, pybind11, SIP, SWIG) for the rclpy library: https://github.com/ros2/rclpy/issues/665 https://docs.google.com/document/d/1BsTlxgzAkpX-5Q7FW6_Hyw1fszDEQef_OEvYmdnrBrE/edit#
Wim, not sure if this is what you meant by improvements in 3.8, but Shane also pointed to possible optimizations via PEP 590: Vectorcall (I dunno if pybind11 already leverages this directly but I see it crop up in this search: https://github.com/pybind/pybind11/issues?q=vectorcall+)
Yes, I was referring to vectorcall. The problem with its implementation is that it creates several distinctive paths, but then squeezes them through a single call interface. The upshot is very branchy code, more memory allocations to track, etc. Not the end of the world, but in the particular case of cppyy, certain code paths such as the handling of self
were done rather differently from what the PEP 590 folks had in mind, so that bolting on vectorcalls as-is leads to unmaintainable code. Hence, I'm rewriting the calling code wholesale to fit vectorcalls better (simplifying memory tracking etc.) and this is proving quite a bit of work, but well worth it (already have a 20% reduction in call overhead, so am confident to hit that 30% when all is said and done).
Not sure about PyBind11, but with its greater call overhead, the benefits will be far less and there is the additional problem of requiring an extra pointer data member per bound object, so vectorcalls do come with a clear memory cost that can only be worth it if the CPU gains are big enough. But then, if the C++ side does substantial work in each call, then the call overhead matters little to nothing (see e.g. the benchmarking text you quote; myself, I benchmark mostly with empty functions), thus leaving that extra pointer as pure waste.
(Aside, the way I currently see it, storing the dispatch pointer as data member in the proxy is something that's only useful for C, not for C++.)
From @wjakob - use perf
to count instructions.
I'd be inclined to disagree (we were working with Levinthal back in the day when perf was being developed). The problem with Python code is that it is very branchy and consists mostly of pointer-chasing, so instructions (even when restricted to retired ones) give you a very muddy picture, as there is lots of out-of-order execution and (micro-)benchmarks will have branching predictability that can differ significantly from real applications.
FYI @EricCousineau-TRI. I've spent some time pondering on this, assuming that the answer to
Should (pybind11) include more comprehensive benchmarking (primarily performance)
is yes. There are many tools out there to write benchmarks, to profile builds, to profile execution, to visualize results. What I think we are lacking is a definition for performance impact, though I ignore the conclusions to which the TensorFlow team has arrived to.
However, it may difficult to objectively and concretely judge the performance impact of a PR, and weigh that against the value of the feature / issue resolution
I'll focus on evaluating the impact of a code change, where impact is defined as a sequence of predefined metrics. To judge that impact or weigh it against the (sometimes subjective) value of a feature / issue resolution is not straightforward in the general case. Even if the net effect of a change is observable, comparing dissimilar metrics requires extra decision making (barring scalarization i.e. sum-of-products with suitable weights).
I'll further assume that new dimensions or axes such as:
CPython versions, pybind11 versions / PR branches / forks
as well as different operating systems and processor architectures can always be added by repeatedly evaluating the same metrics for different build-time and/or run-time configurations.
OS: For now, just Linux
That's fair, but I'll keep some (prudential?) bias towards cross-platform solutions.
To the best of my knowledge, it is generally not possible to fully characterize a piece of software in isolation (e.g. to have a performance baseline that's decoupled from downstream use cases and the infrastructure it runs on). @wlav brought up several good points regarding the difficulties of writing generally applicable benchmarks. Thus, it is perhaps easier to define metrics as a measurement and an associated experiment to perform that measurement.
A few good measurements were suggested above. Improving on that list, I think that the following measurements paint a comprehensive picture (though I'm sure I'm missing a few):
benchmark.py
already does, or can be provided by the compiler (see gcc
-ftime-report
flag, or clang 12
-fproc-stat-report
flag).gcc
-fmem-report
flag).gcc
-fopt-info
flag).To perform those measurements, I think that, to begin with, we can re-purpose a subset of pybind11
's test suite as micro-benchmarks (or write new ones) that exercise the features that we believe are in the "hot path", but eventually we'll need real world examples as well (e.g. a multi-body simulation using pydrake
, multi-process high-bandwidth communication using rclpy
, a tensorflow
classifier, etc.). This is in line with pyperformance, which includes benchmarks for non-core libraries. Results will be scattered but (hopefully more) comprehensive.
As for the tooling, I think pytest-benchmark
should be the benchmark runner, hands down. It makes performance CI checks straightforward. But it'll need extension to perform all of the aforementioned measurements. For that, we can couple it with the timemory
toolkit by re-purposing benchmarks' timer
functions and using their extra_info
attribute.
This toolkit is fairly new, but it offers cross-platform CMake macros as well as C++ and Python APIs to build-time and run-time instrumentation, hardware counters, vendor specific profilers, and more. It has two downsides though: it uses pybind11
itself, which means we have to build it from source and/or mask the pybind11
symbols for non ABI compliant versions, and it lacks a cohesive cross-language call graph (see py-spy
, though arguably this can be contributed to timemory
, and it is relevant for manual performance analysis only).
Thoughts? I haven't had time to check https://bitbucket.org/wlav/cppyy/src/master/bench/, but I will. I'd also appreciate @kkimdev insights.
Sounds great! Some minor questions:
Peak build RSS
My naive Google/DDG-fu couldn't resolve my understanding of this term. Do you have good ref. for me to understand this?
As for the tooling, [...] It makes performance CI checks straightforward. [...]
How would you imagine CI checks being performed? If they are costly (e.g. >=15min?), my suggestion is that they be relegated to "on-demand" builds and semi-infrequent measures (weekly). From there, metrics at a coarse level can be tracked, then fine-grained inspection can be done via bisecting across a specific increase could be done, rather than indexing a "huge" corpus of data.
Thoughts? (is that already a said feature of pytest-benchmark
or timemory
?)
This toolkit is fairly new, [...]
I briefly perused docs, but not sure if I saw concrete recommendation on data aggregation / metrics visualization. I saw roofline
and "data tracking" components, but not sure if it's for weeks-long data aggregation.
Do you have thoughts on how to review these changes? (one option is to use external free service; heck, even something like TensorBoard
or https://wandb.ai/ could be used to track metrics flexibly w/ sufficient traceability)
[From
timemory
page] Adheres to C++ standard template library paradigm of "you don't pay for what you don't use"
Hehe, mayhaps "you shouldn't pay for what you don't use" is a more apt description in most cases :P
My naive Google/DDG-fu couldn't resolve my understanding of this term. Do you have good ref. for me to understand this?
In all cases, RSS stands for resident set size. During builds, perhaps inspecting the total virtual memory requirements may be revealing too.
How would you imagine CI checks being performed? If they are costly (e.g. >=15min?), my suggestion is that they be relegated to "on-demand" builds and semi-infrequent measures (weekly). From there, metrics at a coarse level can be tracked, then fine-grained inspection can be done via bisecting across a specific increase could be done, rather than indexing a "huge" corpus of data.
Agreed. I did a poor job above separating metrics by granularity (and focused on their feasibility instead). I definitely expect most automated benchmarks to perform coarse measurements and be run as such. A CI check for a PR'd patch simply amounts to a faster regression check, and for that we need simple yet representative measurements for each metric (e.g. benchmarks for the hottest or least cache friendly execution paths). It will require empirical evidence, and a process to update these measurements as the system evolves, so not the first thing in the queue.
Thoughts? (is that already a said feature of
pytest-benchmark
ortimemory
?)
No, and no. timemory
provides the tooling, pytest-benchmark
allows benchmarks to join the test suite (by comparing against given previous benchmark results for a single metric), but neither address nor advise on metric granularity or traceability.
I briefly perused docs, but not sure if I saw concrete recommendation on data aggregation / metrics visualization. I saw roofline and "data tracking" components, but not sure if it's for weeks-long data aggregation.
timemory
outputs hierarchical data structures that can be pulled up using pandas
dataframes and analyzed using hatchet
. There's support for basic visual representations, like flamegraphs, but otherwise it's on the user to operate on data and draw conclusions.
Do you have thoughts on how to review these changes? (one option is to use external free service; heck, even something like TensorBoard or https://wandb.ai/ could be used to track metrics flexibly w/ sufficient traceability)
+1 to an external service. Pandas is quite popular, shouldn't be hard to find a good, cheap service that can process them. Commit hashes can be used for tagging. Still, that doesn't address how to review changes in performance. And neither do I :sweat_smile:. Quoting myself:
I'll focus on evaluating the impact of a code change, [...]. To judge that impact or weigh it against the [...] value of a feature / issue resolution is not straightforward in the general case. Even if the net effect of a change is observable, comparing dissimilar metrics requires extra decision making [...].
I don't know, TBH. We need data. I guess it's likely we'll end up focusing on trends in global metrics (like total execution time per benchmark and per suite) as opposed to fluctuations in local metrics (like cache misses in a function). Or maybe not. Maybe a few local metrics for a handful of functions shape everything else and we ought to be paying attention to them instead.
timemory
and roofline
are actually local products. :) And roofline
isn't a useful metric for binders, as they explicitly sit in between two separate memory allocator runtimes, so most of their work consists of pointer chasing, so even as roofline
will chastise them for that, it can't be helped.
As to @hidmic's last point, yes, there are a series of "obvious" functions that matter to almost everyone (resolve an overload, access an int
, etc.), but then there is a range of specific use cases that can make a big performance difference for specific projects, but fail to run optimally b/c either these haven't been identified yet as candidates for an optimized code path, or some confluence of features happens to interact in a sub-optimal way.
Identifying those cases is the hard bit, not writing the optimized code paths, which is why a benchmark suite that is run over multiple binders would be useful for me, even as the original post has "other binders" only as a stretch goal: if one binder is faster than another on some benchmark, it is an obvious indicator that some optimization is missed in the slower binder and such information is a lot easier to act on than trying to optimize a given benchmark where it's not known a priori whether (further) optimization is even possible.
Aside, PyPy development often progresses like that, too: someone compares pypy-c
performance to python
performance for their specific code and finds it lacking. They report it (hopefully!), code gets analyzed and the JIT grows a new optimization path. Here's a recent example: https://www.nature.com/articles/s41550-021-01342-y
Just an update on my comment from Jan 22 above: cppyy now supports vectorcall (in repo; not yet released). Yes, it can achieve 35% performance improvements in some cases, but that isn't true across the board, with SWIG with -builtin
still outperforming in several, in particular simpler, cases (default SWIG is way slower; -builtin
creates less functional bindings, meaning it is not for everyone), so there is still more detailing (and refactoring) to do. Hopefully that will go faster than getting support in in the first place; the changes have been very invasive and especially memory management of call arguments became a lot more difficult, but I now have several helpers that work across Python versions to make it manageable.
Better late but never ... here are results from some benchmarking I did back in February 2021:
I was comparing performance of different holder types at the time (not too much to report), but was surprised by this finding (copied from the slide):
get_type_info
.I investigated why pybind11 was slow to e.g. create objects, or cast objects to the C++ type. Long story short, skipping complex logics, and just doing what's needed, we can get significantly faster.
To know whether it was the detail::instance
memory layout, or the actual code, i implemented templated function that does just what is needed. For example, to get back the PyType that has been associated to a type T, there is no need to do a runtime lookup every time, and we can do it once and store it into a static value:
template <typename T>
bool fast_type_check(handle h) {
static PyTypeObject* expected_type = []() {
const std::type_info& ti = typeid(T);
detail::type_info* internal_ti =
detail::get_global_type_info(std::type_index(ti));
return internal_ti->type;
}();
The cast is then:
template <typename T>
T* reinterpret_cast_unchecked(handle h) {
return reinterpret_cast<T*&>(
reinterpret_cast<detail::instance*>(h.ptr())->simple_value_holder[0]);
}
NOTE: Of course, this only works for a C++ type wrapped, without multiple inheritance, or inheritance of a C++ type from Python, etc. But it's reasonable to expect this use-case to be fast, and more complex features to not always be executed when not needed.
Generally speaking, whether we can perform many of these lookup only once depends how much dynamic stuff is done by the user. I expect types to be created at start-up, and, if they are created later, they won't be used until they are created, and they won't be destroyed. In that case, doing these retrieval only once make sense. I think it should be the default, and, if one really want fully dynamic type_casters that always perform a lookup, they should build their own (or we template it with a true/false).
We can get significantly faster, close to the raw C API performance, by only doing what is needed and nothing more.
Creating 1000000 objects from Python
Raw API took 0.25s
Pybind11 FAST took 0.24s
Pybind11 took 0.40s
Accessing the C++ object from the Python wrapped object
Raw C API cast took 0.14s
Pybind11Fast took 0.14s
Pybind11 took 0.19s
Accessing many C++ objects from the Python wrapped objects
Raw C API cast took 1.14s
Pybind11 Fast took 1.31s
Pybind11 took 6.37s
Creating many C++ objects and returning them to Python
Raw C API cast took 1.55s
Pybind11 FAST took 1.27s
Pybind11 took 4.38s
I am comparing directly storing a struct, with storing a unique_ptr of the struct, but it's easy to adapt.
So I guess we could be rewriting or adding new type-casters/templated function which are simpler.
As that document points to the dispatcher loop as the problem (which is also one of the slowest parts in SWIG, so it's not necessarily completely solvable with code generation): that's a perfect place for memoization based on the argument types. In CPython, cppyy does this explicitly, in PyPy, this is implicitly done by the JIT (types are guarded at the beginning of a JITed trace). If the type conversion of arguments of the memoized call fails, the run-time falls back on the normal, slower, overload resolution.
NOTE: Of course, this only works for a C++ type wrapped, without multiple inheritance, or inheritance of a C++ type from Python, etc.
Yes; and these are still static properties. In cppyy, a flag is set for these cases and an optimized path is chosen, where possible, at run-time based on that flag. (The equivalent of the code above in cppyy is skipping the offset calculation.)
Just to know what it would take to push some of the performance improvements. There are some changes which I think are beneficial, but are breaking the API (so, either we add new methods, and deprecate the current ones, which is ugly, or we increment a major version).
For example, making pybind11::isinstance
not get the type each time, speeds up this method a lot.
template <typename T, detail::enable_if_t<!std::is_base_of<object, T>::value, int> = 0>
bool fast_isinstance(handle obj) {
static handle type = []() {
handle type = detail::get_type_handle(typeid(T), false);
return type;
}();
return isinstance(obj, type);
}
instead of the current
template <typename T, detail::enable_if_t<!std::is_base_of<object, T>::value, int> = 0>
bool isinstance(handle obj) { return detail::isinstance_generic(obj, typeid(T)); }
However, it assumes that the user won't dynamically add the type after this is called, as it would invalidate the cache.
So how does the governance work for such breaking changes? I am wondering if there is a way to get them in, writing an email with an exact list of proposals, with the benchmark showing the improvements, but I am not sure there is a formal process to get the owners to decide on these breaking changes. Can someone share how the pybind11 deal with such breaking changes proposals?
Motivation
At present (b7dfe5cc),
pybind11
proper only benchmarks compile-time and and artifact size for one given test setup (which tests arguments, simple inheritance, but that's about it, I think); the results of which can be seen here: https://pybind11.readthedocs.io/en/stable/benchmark.html https://github.com/pybind/pybind11/blob/v2.6.1/docs/benchmark.rstHowever, it may difficult to objectively and concretely judge the performance impact of a PR, and weigh that against the value of the feature / issue resolution. Generally, benchmarking is done on an ad-hoc basis (totes works, but may make it difficult for less creative people like myself ;)
Primary motivating issues / PRs:
1333
Secondary:
2322
1227 / #1825
376
Fuzzy Scoping + Steps
pybind11
(out of scope: other binding approaches)dlopen
- ish stuff, pybind11 internals upstart, binding registration, ...)pybind11
finds the most important (how to weigh compile-time, size, speed, memory, etc.)Given that performance benchmarks can be a P.I.T.A. (e.g. how to OS + interrupts, hardware capacity / abstractions, blah blah), ideally decisions should be made about relative performance on the same machine. Ideally, we should also publish some metrics for a given config to give people a "feel" for the performance, as was done for compile time.
Suggested Solution Artifacts
github.com/pybind/pybind-benchmarks
?pytest-benchmark
@wjakob @rwgk @rhaschke @YannickJadoul @bstaletic @henryiii @ax3l Can I ask what y'all think? Is this redundant w.r.t. what we already have?