pybind / pybind11

Seamless operability between C++11 and Python
https://pybind11.readthedocs.io/
Other
15.73k stars 2.11k forks source link

[BUG]: Performance gap between two functions when running code in Python, almost no gap in C++ #4533

Open luigibrancati opened 1 year ago

luigibrancati commented 1 year ago

Required prerequisites

What version (or hash if on master) of pybind11 are you using?

2.10.3

Problem description

Hello,

I'm quite new to pybind11, I started using it recently to provide an easy to use Python interface to a C++ code. I already tried asking about my problem here and on gitter, where I was suggested to open a bug report.

In this project (hosted here), I have a monolithic function which I tried to refactor by splitting it into multiple helper functions and one main function, all of them written in C++ with no interactions with Python objects. Python only binds the main function. If I profile the old and refactored function using only C++ both take almost the same time to run, with a few ms difference; if I bind each to a Python function and run them in a python script the refactored function takes twice as long as the old function.

The closest I could find to this issue is this issue. However, I don't think that the performance degradation I experienced is due to too many objects crossing the C++ <--> Python boundary since in my code all calculations happen exclusively on the C++ side while Python works only as an interface (as far as I can tell, at least).

Code

I'm sorry if the code is quite long, but I wasn't able to reduce it more than what follows. First, I have an header global.h with 2 classes not bound to Python and 2 variables needed later for profiling

global.h

#ifndef _GLOBAL_H
#define _GLOBAL_H

#include <string>
#include <array>

enum Suit {Hearts = 1, Diamonds, Clubs, Spades = 4};
enum HandType {
    HighCard = 1,
    Pairs,
    DoublePairs,
    Triples,
    Straight,
    Flush,
    FullHouse,
    Poker,
    StraightFlush,
    RoyalFlush = 10
};

struct Card{
    // 1 to 13 (2 to A)
    short value;
    // 0 to 3 (H to D)
    Suit suit;

    Card():
        value(0),
        suit((Suit) 0)
    {}

    Card(short value, short suit):
        value(value),
        suit((Suit) suit)
    {}

    bool operator==(const Card& rhs) const {
        return (this->value == rhs.value) && (this->suit == rhs.suit);
    }

    bool operator>(const Card& rhs) const {
        if(this->value > rhs.value){
            return true;
        } else if(this->value == rhs.value) {
            return this->suit > rhs.suit;
        }
        return false;
    }

    bool operator>=(const Card& rhs) const {
        return (*this > rhs) || (*this == rhs);
    }
};

struct Hand{
    HandType hand_type;
    array<Card,5> Cards;

    Hand():
        hand_type(HighCard),
        Cards()
    {}

    Hand(short hand_type, array<Card,5> cards):
        hand_type((HandType) hand_type),
        Cards(cards)
    {}
};

// Variables used for profiling
std::array<std::array<int, 2>, 7> test_cards = {{{12, 2},{9, 1},{12, 3},{13, 3},{4, 3},{3, 2},{2, 2}}};
const int profiling_iterations = 100000000;

#endif

In the header algo_old.h I have the old monolithic function algo_old::get_best_hand which I tried refactoring. Note that at the end of this header I added profiling function in order to ease profiling of this C++ code later on

algo_old.h

#ifndef _ALGORITHM_OLD_H
#define _ALGORITHM_OLD_H

#include "global.h"
#include <string>
#include <vector>
#include <array>
#include <algorithm>

using namespace std;

namespace algo_old {

    Hand get_best_hand(array<Card,7> cards){
        // Assumes the cards are ordered in decreasing order
        // Divide the cards by suit
        array<array<Card,7>,4> color_cards;
        array<int,4> num_color_cards = {0,0,0,0};
        color_cards[cards[0].suit - 1][0] = cards[0];
        num_color_cards[cards[0].suit - 1]++;
        // To get the straights
        array<Card,5> current_straight;
        current_straight[0] = cards[0];
        int current_straight_size = 1;
        array<Card,5> straight;
        bool found_straight = false;
        // To get straight flushes
        array<array<Card,5>,4> straight_flushes;
        straight_flushes[cards[0].suit - 1][0] = cards[0];
        array<int,4> straight_flushes_sizes = {0,0,0,0};
        straight_flushes_sizes[cards[0].suit - 1] += 1;
        array<Card,5> straight_flush;
        bool found_straight_flush = false;
        // To get rest of groupings
        array<array<Card,2>,3> pairs;
        int num_pairs = 0;
        array<array<Card,3>,2> triples;
        int num_triples = 0;
        array<Card,4> pokers;
        bool found_poker = false;
        array<Card,7> individuals;
        int num_individuals = 0;
        // Temporary variables;
        array<Card,4> group;
        int group_size = 1;
        group[0] = cards[0];
        Card last_card = cards[0];
        // Iterate through the sorted cards to extract the possible hands
        for(int i=1; i<7; i++)
        {
            // Search for pairs,triples,etc
            if (cards[i].value == last_card.value){
                group[group_size] = cards[i];
                group_size++;
            }else{
                switch (group_size)
                {
                case 1:
                    individuals[num_individuals] = group[0];
                    num_individuals++;
                    break;
                case 2:
                    pairs[num_pairs][0] = group[0];
                    pairs[num_pairs][1] = group[1];
                    num_pairs++;
                    break;
                case 3:
                    copy(group.begin(),group.begin()+3,triples[num_triples].begin());
                    num_triples++;
                    break;
                case 4:
                    pokers = group;
                    found_poker = true;
                    break;
                }
                group_size = 1;
                group[0] = cards[i];
            }
            // Search for straights
            if (!found_straight){
                if (last_card.value - cards[i].value == 1){
                    current_straight[current_straight_size] = cards[i];
                    current_straight_size++;
                }else if (cards[i].value != last_card.value){
                    current_straight_size = 1;
                    current_straight[0] = cards[i];
                }
                if (current_straight_size == 5){
                    straight = current_straight;
                    //copy(current_straight,current_straight+5,straight);
                    current_straight_size = 0;
                    found_straight = true;
                }
            }
            // Search for flushes
            color_cards[cards[i].suit - 1][num_color_cards[cards[i].suit - 1]] = cards[i];
            num_color_cards[cards[i].suit - 1]++;
            // Search for straight flushes
            if (straight_flushes_sizes[cards[i].suit - 1] != 0){
                if (straight_flushes[cards[i].suit - 1][straight_flushes_sizes[cards[i].suit - 1]-1].value - cards[i].value == 1){
                    straight_flushes[cards[i].suit - 1][straight_flushes_sizes[cards[i].suit - 1]] = cards[i];
                    straight_flushes_sizes[cards[i].suit - 1]++;
                    if (straight_flushes_sizes[cards[i].suit - 1] == 5){
                        straight_flush = straight_flushes[cards[i].suit - 1];
                        //copy(straight_flushes[cards[i].suit - 1],straight_flushes[cards[i].suit - 1]+5,straight_flush);
                        found_straight_flush = true;
                        straight_flushes_sizes[cards[i].suit - 1] = 0;
                    }
                }else{
                    straight_flushes_sizes[cards[i].suit - 1] = 1;
                    straight_flushes[cards[i].suit - 1][0] = cards[i];
                }
            }else{
                straight_flushes[cards[i].suit - 1][0] = cards[i];
                straight_flushes_sizes[cards[i].suit - 1]++;
            }
            last_card = cards[i];
        }
        // Add the last group
        switch (group_size)
        {
        case 1:
            individuals[num_individuals] = group[0];
            num_individuals++;
            break;
        case 2:
            pairs[num_pairs][0] = group[0];
            pairs[num_pairs][1] = group[1];
            num_pairs++;
            break;
        case 3:
            copy(group.begin(),group.begin()+3,triples[num_triples].begin());
            num_triples++;
            break;
        case 4:
            pokers = group;
            found_poker = true;
            break;
        }
        // Check for straight flushes
        if (!found_straight_flush){
            // Iterate through the cards to see if ACES of some suit makes a straight flush
            for(int i=0; i<7; i++)
            {
                if (cards[i].value != 13){
                    break;
                }else if (straight_flushes_sizes[cards[i].suit - 1] == 4 && straight_flushes[cards[i].suit - 1][0].value == 4){
                    copy(straight_flushes[cards[i].suit - 1].begin(),straight_flushes[cards[i].suit - 1].begin()+4,straight_flush.begin());
                    straight_flush[4] = cards[i];
                    found_straight_flush = true;
                    break;
                }

            }
        }
        // Create the result hand
        Hand result;
        if (found_straight_flush){
            if (straight_flush[0].value == 13){
                result.hand_type = RoyalFlush;
            }else{
                result.hand_type = StraightFlush;
            }
            result.Cards = straight_flush;
            // copy(straight_flush,straight_flush+5,result.Cards);
            return result;
        }

        if (found_poker){
            result.hand_type = Poker;
            copy(pokers.begin(),pokers.end(),result.Cards.begin());
            for(int i=0; i<7; i++){
                if (cards[i].value != pokers[0].value){
                    result.Cards[4] = cards[i];
                    break;
                }
            }
            return result;
        }

        if (num_triples > 0 && num_pairs > 0){
            result.hand_type = FullHouse;
            copy(triples[0].begin(),triples[0].end(),result.Cards.begin());
            result.Cards[3] = pairs[0][0];
            result.Cards[4] = pairs[0][1];
            return result;
        }else if(num_triples == 2){
            result.hand_type = FullHouse;
            copy(triples[0].begin(),triples[0].end(),result.Cards.begin());
            result.Cards[3] = triples[1][0];
            result.Cards[4] = triples[1][1];
            return result;
        }
        // Check flushes
        for (int i = 0; i < 4; i++)
        {
            if (num_color_cards[i] >= 5){
                result.hand_type = Flush;
                copy(color_cards[i].begin(),color_cards[i].begin()+5,result.Cards.begin());
                return result;
            }
        }

        // Check if the straight form with an ACE
        if (!found_straight && current_straight_size == 4 && current_straight[3].value == 1 && cards[0].value == 13){
            copy(current_straight.begin(),current_straight.begin()+4,straight.begin());
            found_straight = true;
            straight[4] = cards[0];
        }

        if (found_straight){
            result.hand_type = Straight;
            result.Cards = straight;
            return result;
        }

        if (num_triples == 1){
            result.hand_type = Triples;
            copy(triples[0].begin(),triples[0].begin()+3,result.Cards.begin());
            result.Cards[3] = individuals[0];
            result.Cards[4] = individuals[1];
            return result;
        }

        if (num_pairs >= 2){
            result.hand_type = DoublePairs;
            result.Cards[0] = pairs[0][0];
            result.Cards[1] = pairs[0][1];
            result.Cards[2] = pairs[1][0];
            result.Cards[3] = pairs[1][1];
            result.Cards[4] = individuals[0];
            return result;
        }

        if (num_pairs == 1){
            result.hand_type = Pairs;
            result.Cards[0] = pairs[0][0];
            result.Cards[1] = pairs[0][1];
            result.Cards[2] = individuals[0];
            result.Cards[3] = individuals[1];
            result.Cards[4] = individuals[2];
            return result;
        }
        result.hand_type = HighCard;
        copy(individuals.begin(),individuals.begin()+5,result.Cards.begin());
        return result;
    }

    Hand get_best_hand_not_sorted(array<Card,7> cards){
        sort(cards.begin(),cards.end(),[](Card &a,Card &b){return a.value > b.value;});
        return get_best_hand(cards);
    }

    void profiling(array<array<int, 2>, 7> cards_init, int N){
        array<Card, 7> cards;
        for(int i = 0; i < 7; ++i) cards[i] = Card(cards_init[i][0], cards_init[i][1]);
        for (int i = 0; i < N; ++i) {
            get_best_hand_not_sorted(cards);
        }
    }
}

#endif

Next I have the header algo_new.h with the refactored function algo_new::get_best_hand (I added the profiling function here too)

algo_new.h

#ifndef _ALGORITHM_NEW_H
#define _ALGORITHM_NEW_H

#include "global.h"
#include <string>
#include <vector>
#include <array>
#include <algorithm>

using namespace std;

namespace algo_new {

    using card_range = pair<int8_t, int8_t>;
    using card_vec = vector<int8_t>;
    const int8_t NULL_PTR = -1;
    const card_range NULL_RANGE = {NULL_PTR, NULL_PTR};
    const uint8_t TOTAL_CARDS = 52;

    template<class T>
    void sort_cards(T& cards) {
        std::sort(cards.begin(), cards.end(), [](Card& a, Card& b){return (a>=b);});
    }

    card_vec expand_range(card_range range) {
        // Converts two pointers with a range into vector of pointers
        if (range != NULL_RANGE) {
            card_vec card_vec;
            card_vec.reserve(range.second-range.first+1);
            for(int8_t i = range.first; i <= range.second; i++){
                card_vec.push_back(i);
            }
            return card_vec;
        } else {
            return {};
        }
    }

    void copy_vec(array<Card,7>& cards, array<Card, 5>& array, card_vec& vec) {
        for (int8_t i = 0; i < 5; i++) array[i] = cards[vec[i]];
    }

    void sort_vec(card_vec& vec){
        sort(vec.begin(), vec.end(), [](int8_t& a, int8_t& b){return a<=b;});
    }

    card_vec pad_card_vec(card_vec vec, array<Card,7>& cards, uint8_t N) {
        // Expand a vector of cards by including cards in value order
        // Assumes the cards are ordered in decreasing order
        vec.reserve(N);
        for(int8_t c = 0; (c < cards.size()) && (vec.size() < N); c++) {
            if(find(vec.begin(), vec.end(), c) == vec.end()) {
                vec.push_back(c);
            }
        }
        sort_vec(vec);
        return vec;
    }

    uint8_t count_aces(array<Card, 7>& array){
        uint8_t count = 0;
        for(Card card:array){
            if(card.value == 13) ++count;
        }
        return count;
    }

    card_vec find_flush(array<Card,7>& cards) {
        // Returns a vector, either empty or with the 5 cards making a flush
        // Assumes the cards are ordered in decreasing order
        card_vec flush_cards;
        flush_cards.reserve(5);
        for (int8_t start = 0; start < 3; start++){
            Suit s = cards[start].suit;
            uint8_t num_same_suit = 0;
            for (int8_t j = start; j < cards.size(); j++){
                if((num_same_suit < 5) && (cards[j].suit == s)){
                    ++num_same_suit;
                    flush_cards.push_back(j);
                }
            }
            if(num_same_suit>=5){
                return flush_cards;
            } else {
                flush_cards.clear();
            }
        }
        return flush_cards;
    }

    bool is_straight(array<Card,7>& cards, card_range range) {
        // Assumes the cards are ordered in decreasing order
        for(int8_t i = range.first+1; i <= range.second; i++) {
            if(cards[i].value != (cards[i-1].value - 1)){
                return false;
            }
        }
        return true;
    }

    vector<card_vec> find_all_straights(array<Card,7>& cards) {
        // Returns a vector of vectors, either empty or with all straights found
        // Assumes the cards are ordered in decreasing order
        vector<card_vec> all_straight_cards;
        all_straight_cards.reserve(3); // max 3 straights
        // If cards are already ordered one of first, second or third card must be part of a straight (if there's one)
        // Start from high card, so straights are ordered based on high card
        for(uint8_t start = 0; start < 3; start++){
            card_range range = {start, start + 4};
            if(is_straight(cards, range)) {
                all_straight_cards.push_back(expand_range(range));
            }
        }
        // Consider ACE = 1
        uint8_t num_aces = count_aces(cards);
        if(num_aces > 0){
            // In order to convert ACE to 1 take the modulus 12 and reorder the cards
            array<Card,7> cards_module = cards;
            for(uint8_t i = 0; i<num_aces; i++){
                cards_module[i].value = 1;
            }
            sort_cards<array<Card,7>>(cards_module);
            for(uint8_t start = 0; start < 3; start++){
                card_range range = {start, start + 4};
                if(is_straight(cards_module, range)) {
                    // Convert range to original cards
                    card_vec vec = expand_range(range);
                    for(auto& v: vec) v = (v + num_aces)%7;
                    sort_vec(vec);
                    all_straight_cards.push_back(vec);
                }
            }
        }
        return all_straight_cards;
    }

    card_range find_repetition(array<Card,7>& cards, card_range range, uint8_t N) {
        // Returns a vector, either {null, null} or with the FIRST (i.e. with the highest cards) repetition of N cards found
        // Assumes the cards are ordered in decreasing order
        for(int8_t s = range.first; s <= (range.second - N + 1); s++){
            int8_t e = (s + N - 1);
            if(cards[s].value == cards[e].value){
                return {s, e};
            }
        }
        return NULL_RANGE;
    }

    vector<card_range> find_all_repetitions(array<Card,7>& cards, uint8_t N) {
        // Returns a vector, either empty or with the ALL repetition of N cards found
        // Assumes the cards are ordered in decreasing order
        vector<card_range> all_reps;
        int8_t start = 0;
        while((cards.size() - start) > 1) {
            card_range rep = find_repetition(cards, card_range({start, cards.size() - 1}), N);
            if(rep != NULL_RANGE){
                all_reps.push_back(rep);
                start = rep.second + 1;
            } else {
                ++start;
            }
        }
        return all_reps;
    }

    Hand get_best_hand(array<Card,7>& cards){
        // Assumes the cards are ordered in decreasing order
        Hand best_hand;
        card_vec flush_vec = find_flush(cards);
        vector<card_vec> all_straights = find_all_straights(cards);
        bool has_flush = !flush_vec.empty();
        bool has_straight = !all_straights.empty();
        // Look for royal/straight flush
        if(has_flush && has_straight) {
            // Look among straights which one is also a flush
            for(card_vec& straight_vec: all_straights){
                if(flush_vec == straight_vec) {
                    best_hand.hand_type = (cards[flush_vec[0]].value == 13)? RoyalFlush : StraightFlush;
                    copy_vec(cards, best_hand.Cards, flush_vec);
                    return best_hand;
                }
            }
        }
        // Look for poker
        card_range poker = find_repetition(cards, card_range({0, cards.size() - 1}), 4);
        if((poker.second + 1 - poker.first) == 4){
            best_hand.hand_type = Poker;
            card_vec poker_vec = expand_range(poker);
            poker_vec = pad_card_vec(poker_vec, cards, 5);
            copy_vec(cards, best_hand.Cards, poker_vec);
            return best_hand;
        }
        // Look for full house
        card_range triple = find_repetition(cards, card_range({0, cards.size() - 1}), 3);
        if (triple != NULL_RANGE) {
            card_range pair_1 = find_repetition(cards, card_range({0, triple.first - 1}), 2); // Pairs can happen before or after the triple
            card_range pair_2 = find_repetition(cards, card_range({triple.second + 1, cards.size() - 1}), 2);
            card_range pair = (pair_1 != NULL_RANGE ? pair_1 : pair_2);
            if(pair != NULL_RANGE){
                best_hand.hand_type = FullHouse;
                card_vec full_vec = expand_range(triple);
                card_vec pair_vec = expand_range(pair);
                full_vec.reserve(5);
                full_vec.insert(full_vec.end(), pair_vec.begin(), pair_vec.end());
                sort_vec(full_vec);
                copy_vec(cards, best_hand.Cards, full_vec);
                return best_hand;
            }
        }
        // Look for flush
        if(has_flush){
            best_hand.hand_type = Flush;
            copy_vec(cards, best_hand.Cards, flush_vec);
            return best_hand;
        }
        // Look for straight
        if(has_straight){
            best_hand.hand_type = Straight;
            // Straight are already ordered, best is first
            copy_vec(cards, best_hand.Cards, all_straights[0]);
            return best_hand;
        }
        // Look for triples
        // use triple from Full House
        if(triple != NULL_RANGE){
            best_hand.hand_type = Triples;
            card_vec triple_vec = expand_range(triple);
            triple_vec = pad_card_vec(triple_vec, cards, 5);
            copy_vec(cards, best_hand.Cards, triple_vec);
            return best_hand;
        }
        // Look for pairs or double pairs
        vector<card_range> all_pairs = find_all_repetitions(cards, 2);
        if(all_pairs.size() >= 2) {
            best_hand.hand_type = DoublePairs;
            card_vec dp_vec = expand_range(all_pairs[0]);
            dp_vec.reserve(5);
            card_vec temp = expand_range(all_pairs[1]);
            dp_vec.insert(dp_vec.end(), temp.begin(), temp.end());
            dp_vec = pad_card_vec(dp_vec, cards, 5);
            copy_vec(cards, best_hand.Cards, dp_vec);
            return best_hand;
        } else if (all_pairs.size() == 1) {
            best_hand.hand_type = Pairs;
            card_vec pair_vec = expand_range(all_pairs[0]);
            pair_vec = pad_card_vec(pair_vec, cards, 5);
            copy_vec(cards, best_hand.Cards, pair_vec);
            return best_hand;
        }
        // High card
        best_hand.hand_type = HighCard;
        copy(cards.begin(), cards.begin() + 5, best_hand.Cards.begin());
        return best_hand;
    }

    Hand get_best_hand_not_sorted(array<Card,7> cards){
        sort_cards<array<Card,7>>(cards);
        return get_best_hand(cards);
    }

    void profiling(array<array<int, 2>, 7> cards_init, int N){
        array<Card, 7> cards;
        for(int i = 0; i < 7; ++i) cards[i] = Card(cards_init[i][0], cards_init[i][1]);
        for (int i = 0; i < N; ++i) {
            get_best_hand_not_sorted(cards);
        }
    }
}

#endif

Last I have the file main.cpp where the python binding happens: notice that here I actually bind the profiling functions, but as you can see above these functions just call get_best_hand_not_sorted and thus get_best_hand multiple times.

main.cpp

#include "global.h"
#include "algo_new.h"
#include "algo_old.h"
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>

using namespace std;
namespace py = pybind11;
using namespace pybind11::literals;

PYBIND11_MODULE(Test, m) {
    m.def("profiling_old", &algo_old::profiling);
    m.def("profiling_new", &algo_new::profiling);
}

Below I also provide the pyproject.toml and the setup.py files used to build this code

pyproject.toml

[build-system]
requires = ["setuptools>=64.0", "wheel", "pybind11==2.10.3"]
build-backend = "setuptools.build_meta"

setup.py

# Available at setup time due to pyproject.toml
from pybind11.setup_helpers import Pybind11Extension
from setuptools import setup

__version__ = "0.0.1.dev0"

ext_modules = [
    Pybind11Extension(
        "Test",
        ["src/main.cpp"],
        cxx_std=20,
        extra_compile_args = ['-O2']
    ),
]

setup(
    name="Test",
    version=__version__,
    ext_modules=ext_modules,
    install_requires=[
        'pybind11==2.10.3'
    ],
    package_data={
        "Test":["py.typed"],
    },
    packages=["Test"],
    zip_safe=False,
    python_requires=">=3.7",
)

Profiling

In order to make sure that both version of get_best_hand take the same time, I profile them using gprof. For profiling I use the following cpp files (note that test_cards and profiling_iterations are defined inside global.h)

profiling_old.cpp

#include "src/global.h"
#include "src/algo_old.h"

int main(){
    algo_old::profiling(test_cards, profiling_iterations);
    return 0;
}

profiling_new.cpp

#include "src/global.h"
#include "src/algo_new.h"

int main(){
    algo_new::profiling(test_cards, profiling_iterations);
    return 0;
}

I compile both files as follows using the -O2 optimization flag (the -pg flag is needed to run gprof) and use gprof to profile them

$ g++ profiling_old.cpp -pg -o profiling_old.out -O2 --std=c++20 && ./profiling_old.out
$ gprof profiling_old.out gmon.out > prof_cpp_old.txt
$ g++ profiling_new.cpp -pg -o profiling_new.out -O2 --std=c++20 && ./profiling_new.out
$ gprof profiling_new.out gmon.out > prof_cpp_new.txt

The output from gprof is as follows

prof_cpp_old.txt

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 72.83      3.35     3.35 100000000     0.00     0.00  algo_old::get_best_hand(std::array<Card, 7ul>)
 20.43      4.29     0.94 100000000     0.00     0.00  algo_old::get_best_hand_not_sorted(std::array<Card, 7ul>)
  5.00      4.52     0.23        1     0.23     4.58  algo_old::profiling(std::array<std::array<int, 2ul>, 7ul>, int)
  1.30      4.58     0.06 100000000     0.00     0.00  void std::__introsort_loop<Card*, long, __gnu_cxx::__ops::_Iter_comp_iter<algo_old::get_best_hand_not_sorted(std::array<Card, 7ul>)::{lambda(Card&, Card&)#1}> >(Card*, Card*, long, __gnu_cxx::__ops::_Iter_comp_iter<algo_old::get_best_hand_not_sorted(std::array<Card, 7ul>)::{lambda(Card&, Card&)#1}>)
  0.43      4.60     0.02                             _init

prof_cpp_new.txt

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 19.19      1.00     1.00 100000000     0.00     0.00  algo_new::find_all_straights(std::array<Card, 7ul>&)
 14.59      1.76     0.76 100000000     0.00     0.00  algo_new::find_flush(std::array<Card, 7ul>&)
 13.44      2.46     0.70 100000000     0.00     0.00  algo_new::get_best_hand(std::array<Card, 7ul>&)
  8.45      2.90     0.44 100000000     0.00     0.00  algo_new::pad_card_vec(std::vector<signed char, std::allocator<signed char> >, std::array<Card, 7ul>&, unsigned char)
  7.49      3.29     0.39                             _init
  6.33      3.62     0.33 300000000     0.00     0.00  std::vector<signed char, std::allocator<signed char> >::reserve(unsigned long)
  5.18      3.89     0.27        1     0.27     4.82  algo_new::profiling(std::array<std::array<int, 2ul>, 7ul>, int)
  4.99      4.15     0.26 100000000     0.00     0.00  algo_new::find_all_repetitions(std::array<Card, 7ul>&, unsigned char)
  4.41      4.38     0.23 100000000     0.00     0.00  algo_new::sort_vec(std::vector<signed char, std::allocator<signed char> >&)
  3.84      4.58     0.20 100000000     0.00     0.00  algo_new::expand_range(std::pair<signed char, signed char>)
  3.45      4.76     0.18 100000000     0.00     0.00  void std::vector<std::pair<signed char, signed char>, std::allocator<std::pair<signed char, signed char> > >::_M_realloc_insert<std::pair<signed char, signed char> const&>(__gnu_cxx::__normal_iterator<std::pair<signed char, signed char>*, std::vector<std::pair<signed char, signed char>, std::allocator<std::pair<signed char, signed char> > > >, std::pair<signed char, signed char> const&)
  2.88      4.91     0.15 300000000     0.00     0.00  std::_Vector_base<signed char, std::allocator<signed char> >::~_Vector_base()
  2.30      5.03     0.12 200000000     0.00     0.00  void std::__introsort_loop<Card*, long, __gnu_cxx::__ops::_Iter_comp_iter<algo_new::sort_cards<std::array<Card, 7ul> >(std::array<Card, 7ul>&)::{lambda(Card&, Card&)#1}> >(Card*, Card*, long, __gnu_cxx::__ops::_Iter_comp_iter<algo_new::sort_cards<std::array<Card, 7ul> >(std::array<Card, 7ul>&)::{lambda(Card&, Card&)#1}>)
  2.30      5.15     0.12 100000000     0.00     0.00  std::vector<signed char, std::allocator<signed char> >::vector(std::vector<signed char, std::allocator<signed char> > const&)
  1.15      5.21     0.06 100000000     0.00     0.00  void std::__introsort_loop<__gnu_cxx::__normal_iterator<signed char*, std::vector<signed char, std::allocator<signed char> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<algo_new::sort_vec(std::vector<signed char, std::allocator<signed char> >&)::{lambda(signed char&, signed char&)#1}> >(__gnu_cxx::__normal_iterator<signed char*, std::vector<signed char, std::allocator<signed char> > >, __gnu_cxx::__normal_iterator<signed char*, std::vector<signed char, std::allocator<signed char> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<algo_new::sort_vec(std::vector<signed char, std::allocator<signed char> >&)::{lambda(signed char&, signed char&)#1}>)

The running time differs by 60 ms for a total of 100,000,000 runs. If I try to run these functions in Python, though, the outcome is quite different

performances.py

import Test
import timeit

test_cards_besthand = [[12, 2],[9, 1],[12, 3],[13, 3],[4, 3],[3, 2],[2, 2]]
profiling_iterations = 100000000

t1 = timeit.timeit(lambda: Test.profiling_old(test_cards_besthand, profiling_iterations), number=1)
print(f"old: {t1:.3f}")

t2 = timeit.timeit(lambda: Test.profiling_new(test_cards_besthand, profiling_iterations), number=1)
print(f"new: {t2:.3f}")
$ python performances.py
old: 5.705
new: 15.752

As you can see, the functions take much longer than their C++ version but most importantly the refactored function takes 3 times as long as the old function, despite them taking the same time when using only C++.

Reproducible example code

Provided in problem description

Is this a regression? Put the last known working version here if it is.

Not a regression

virtuald commented 1 year ago

A huge difference between the C++ code and the python code can be found in your function definition:

void profiling(array<array<int, 2>, 7> cards_init, int N){

For C++, the std::array is going to be handled efficiently by the compiler. In python, the std::array type caster needs to make a new copy of the python list and convert each internal array, which is going to be a ton of extra work done for each function call. Even if you got rid of that somehow, python function calls are fairly expensive, so I would expect 100000000 calls to have quite a bit of additional overhead since pybind11 calls have more overhead than python.

FWIW, this isn't the way to get performance out of pybind11. Anytime you cross the C++/Python boundary, you're going to get hit with some overhead. If you need performance for this example, you would call a single C++ function once that would then call another C++ function 100000000 times.

I'm at a loss for why there's a difference between the old and new versions of the algorithm when executed by python, might be some weird CPU cache thing or some other subtle thing.

luigibrancati commented 1 year ago

If you need performance for this example, you would call a single C++ function once that would then call another C++ function 100000000 times.

Maybe I'm missing something, since you already mentioned this in the gitter discussion, but isn't this exactly what I'm doing? In Python I'm calling the function profiling only once inside the timeit.timeit call

t1 = timeit.timeit(lambda: Test.profiling_old(test_cards_besthand, profiling_iterations), number=1)

Note the number=1 in the call. The profiling_iterations variable is only passed to this function, it's not used in the Python code, and the Python function is directly bound to the C++ function algo_old::profiling as can be seen in main.cpp

#include "global.h"
#include "algo_new.h"
#include "algo_old.h"
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>

using namespace std;
namespace py = pybind11;
using namespace pybind11::literals;

PYBIND11_MODULE(Test, m) {
    m.def("profiling_old", &algo_old::profiling); // HERE
    m.def("profiling_new", &algo_new::profiling);
}

The C++ function algo_old::profiling (or the new one, for what matters) then calls get_best_hand_not_sorted 100 000 000 times, but this happens already on the C++ side.

So my guess is that the C++/Python boundary is crossed only once, the rest happens exclusively on the C++ side. Am I wrong?

virtuald commented 1 year ago

Oh, I misread that, you're totally right.

Wellll... then we're back to cache things? That's weird.

luigibrancati commented 1 year ago

Oh, I misread that, you're totally right.

Wellll... then we're back to cache things? That's weird.

Do you have any suggestion on how to better investigate this?

Skylion007 commented 1 year ago

@luigibrancati You could try to add logging to py::cast operations perhaps and see if things are being casted back and forth more than expected.