Open luigibrancati opened 1 year ago
A huge difference between the C++ code and the python code can be found in your function definition:
void profiling(array<array<int, 2>, 7> cards_init, int N){
For C++, the std::array
is going to be handled efficiently by the compiler. In python, the std::array type caster needs to make a new copy of the python list and convert each internal array, which is going to be a ton of extra work done for each function call. Even if you got rid of that somehow, python function calls are fairly expensive, so I would expect 100000000 calls to have quite a bit of additional overhead since pybind11 calls have more overhead than python.
FWIW, this isn't the way to get performance out of pybind11. Anytime you cross the C++/Python boundary, you're going to get hit with some overhead. If you need performance for this example, you would call a single C++ function once that would then call another C++ function 100000000 times.
I'm at a loss for why there's a difference between the old and new versions of the algorithm when executed by python, might be some weird CPU cache thing or some other subtle thing.
If you need performance for this example, you would call a single C++ function once that would then call another C++ function 100000000 times.
Maybe I'm missing something, since you already mentioned this in the gitter discussion, but isn't this exactly what I'm doing? In Python I'm calling the function profiling
only once inside the timeit.timeit
call
t1 = timeit.timeit(lambda: Test.profiling_old(test_cards_besthand, profiling_iterations), number=1)
Note the number=1
in the call. The profiling_iterations
variable is only passed to this function, it's not used in the Python code, and the Python function is directly bound to the C++ function algo_old::profiling
as can be seen in main.cpp
#include "global.h"
#include "algo_new.h"
#include "algo_old.h"
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
using namespace std;
namespace py = pybind11;
using namespace pybind11::literals;
PYBIND11_MODULE(Test, m) {
m.def("profiling_old", &algo_old::profiling); // HERE
m.def("profiling_new", &algo_new::profiling);
}
The C++ function algo_old::profiling
(or the new one, for what matters) then calls get_best_hand_not_sorted
100 000 000 times, but this happens already on the C++ side.
So my guess is that the C++/Python boundary is crossed only once, the rest happens exclusively on the C++ side. Am I wrong?
Oh, I misread that, you're totally right.
Wellll... then we're back to cache things? That's weird.
Oh, I misread that, you're totally right.
Wellll... then we're back to cache things? That's weird.
Do you have any suggestion on how to better investigate this?
@luigibrancati You could try to add logging to py::cast
operations perhaps and see if things are being casted back and forth more than expected.
Required prerequisites
What version (or hash if on master) of pybind11 are you using?
2.10.3
Problem description
Hello,
I'm quite new to
pybind11
, I started using it recently to provide an easy to use Python interface to a C++ code. I already tried asking about my problem here and on gitter, where I was suggested to open a bug report.In this project (hosted here), I have a monolithic function which I tried to refactor by splitting it into multiple helper functions and one main function, all of them written in C++ with no interactions with Python objects. Python only binds the main function. If I profile the old and refactored function using only C++ both take almost the same time to run, with a few ms difference; if I bind each to a Python function and run them in a python script the refactored function takes twice as long as the old function.
The closest I could find to this issue is this issue. However, I don't think that the performance degradation I experienced is due to too many objects crossing the C++ <--> Python boundary since in my code all calculations happen exclusively on the C++ side while Python works only as an interface (as far as I can tell, at least).
Code
I'm sorry if the code is quite long, but I wasn't able to reduce it more than what follows. First, I have an header
global.h
with 2 classes not bound to Python and 2 variables needed later for profilingglobal.h
In the header
algo_old.h
I have the old monolithic functionalgo_old::get_best_hand
which I tried refactoring. Note that at the end of this header I addedprofiling
function in order to ease profiling of this C++ code later onalgo_old.h
Next I have the header
algo_new.h
with the refactored functionalgo_new::get_best_hand
(I added theprofiling
function here too)algo_new.h
Last I have the file
main.cpp
where the python binding happens: notice that here I actually bind theprofiling
functions, but as you can see above these functions just callget_best_hand_not_sorted
and thusget_best_hand
multiple times.main.cpp
Below I also provide the
pyproject.toml
and thesetup.py
files used to build this codepyproject.toml
setup.py
Profiling
In order to make sure that both version of
get_best_hand
take the same time, I profile them usinggprof
. For profiling I use the followingcpp
files (note thattest_cards
andprofiling_iterations
are defined insideglobal.h
)profiling_old.cpp
profiling_new.cpp
I compile both files as follows using the
-O2
optimization flag (the-pg
flag is needed to rungprof
) and usegprof
to profile themThe output from
gprof
is as followsprof_cpp_old.txt
prof_cpp_new.txt
The running time differs by 60 ms for a total of
100,000,000
runs. If I try to run these functions in Python, though, the outcome is quite differentperformances.py
As you can see, the functions take much longer than their C++ version but most importantly the refactored function takes 3 times as long as the old function, despite them taking the same time when using only C++.
Reproducible example code
Provided in problem description
Is this a regression? Put the last known working version here if it is.
Not a regression