py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/pyfixest.html
MIT License
120 stars 28 forks source link

C++ 11 backend via fixest/fixest2 API #66

Open pachadotdev opened 1 year ago

pachadotdev commented 1 year ago

Hi! I have a work in progress that I am writing in a more "pure" C++ grammar: https://github.com/pachadotdev/fixest2/tree/cpp11_wip My idea is to eventually create something like Apache Arrow, where the core functionality is independent from R, and it provides the same set of function on R, Python, JavaScript, etc. In this case, my goal will be covering R and Python. One of the problems that I have now are unassigned memory and memory leaks, and I need to debug those. At the moment, simple OLS works well and it was rewritten in a header-only C++ backend with the idea of also using vendoring at some point.

pachadotdev commented 1 year ago

Hi @lrberge! Did you see this excellent idea in Python? I have nothing to do with it, and I want to see it completed to compare my results in R against a Python version.

s3alfisc commented 1 year ago

Hi @pachadotdev - thanks for pointing me to your fixest2 project - it looks pretty neat! I think we have talked about it at some point in the ropensci slack? I am surprised I did not star it back in the day, to be honest =)

Eventually I would love to move the heavy lifting - mostly the alternating projections algorithm - to a lower lever back end. So please keep me in the loop on your progress!

I have set up some tests against fixest via rpy2 and run them via github CI - you can find the tests here and the CI yaml here. Maybe this will be useful for you?

pachadotdev commented 1 year ago

thanks! https://github.com/s3alfisc/pyfixest/blob/master/tests/test_vs_fixest.py is very useful indeed! I'll keep you posted, and eventually ppl should be able to call fixest2 from Python without calling R

pachadotdev commented 1 year ago

also, I think so about the ropensci slack, I do not connect so often

lrberge commented 1 year ago

Regarding the python port, IMO the backend for the demeaning would be the best way to go and the effort wouldn't be that important since everything to be dealt with lies in a single function, cpp_demean.

That function takes in R objects and converts everything into a C++ structure (in particular via the sMat and sVec classes and their initialization which can easily be adapted). The demeaning algorithm is full C++ and then is portable.

Only the very first input (transforming/adapting python objects into C++ structures) and the output need to be taken care of. It looks doable but requires someone knowledgeable in python data structures.

One word of caution though: the API needs to be constructed carefully (currently the input to cpp_demean may be too flexible / what is the best data structure in output?) and this may need some back and forth.

s3alfisc commented 1 year ago

Hi @lrberge - thanks for chiming in here =) Great that the demeaning actually "lives" in only one function. That should definitely help bring the task within my range of skill. Though I am definitely not an expert on Python /c++ data types and also still have to get up to speed with how to integrate c++ into python (the most popular options seem to be either cython or pybind). I'll ping both of you once I have played around with both frameworks and your c++ codes, I am certain that I will need some help =)

pachadotdev commented 1 year ago

@s3alfisc I think that "copying" from Apache Arrow would be great. They wrote the main library in C++ and then added Python, R, JavaScript etc layers to use Arrow in other languages. The good news is that a big part of fixest is C++ but in Rcpp flavour. Let's say, fixest is vanilla yogurt, but we need unflavored yogurt to call it from Python, if we "cook" it a bit, we can make it "taste" different. What I made with cpp11 is "low calorie vanilla", but still needs additional processing. i shall write on ropensci about this idea.

s3alfisc commented 1 year ago

Great metaphors @pachadotdev! 😄

pachadotdev commented 1 year ago

also @s3alfisc the cpp11 version (https://github.com/pachadotdev/fixest2/blob/cpp11_wip/R/cpp11_default_parameters.R) has functions() and functions_(), which is an initial step to think about separating the C++ backend from the frontend in R/Python/Java/etc. When you call fun_a() then that function calls fun_a_() and passes default params to it.

s3alfisc commented 1 year ago

Hi @pachadotdev and @lrberge - I have just finished implementing a python clone of fixest's demeaning function. I use numba, which is JIT compiled and promises 'c like performance with python syntax'. It's a very naive implementation of the MAP algorithm and shockingly (but maybe unsurprisingly ?) around 10 - 20 times slower than fixest. You can find the implementation here.

I have tried to look at the source code of cpp_demean to understand if fixest uses any tricks to squeeze out better performance, but did not get any smarter (turns out I am not very good at reading c++ code). I have a few questions on the algo, and if you have the time, it would be great if you could answer them!

Thanks! =)

s3alfisc commented 1 year ago

As a sidepoint, my main takeaway is that I should probably just take the bullet and read up on integrating c++ code into a python package =D

pachadotdev commented 1 year ago

This is amazing!

Fixest uses OMP and does what you said.

Regarding the C++ integration, let me get a 100% working cpp11 version (unofficial), and we can connect it directly with Python.

Best,

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Alexander Fischer @.> Sent: Sunday, June 4, 2023 7:56:45 AM To: s3alfisc/pyfixest @.> Cc: Mauricio Andres Vargas Sepulveda @.>; Mention @.> Subject: Re: [s3alfisc/pyfixest] C++ 11 backend via fixest/fixest2 API (Issue #66)

As a sidepoint, my main takeaway is that I should probably just take the bullet and read up on integrating c++ code into a python package =D

— Reply to this email directly, view it on GitHubhttps://github.com/s3alfisc/pyfixest/issues/66#issuecomment-1575539971, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACM7UOP2ISDEZHHRPM4MUCDXJRZX3ANCNFSM6AAAAAAXRPRYDI. You are receiving this because you were mentioned.Message ID: @.***>

aeturrell commented 9 months ago

Just spotted this, and super excited to see the "Arrow for regressions" idea becoming reality @pachadotdev and @lrberge!

pachadotdev commented 9 months ago

@s3alfisc i started here https://github.com/pachadotdev/pyfixest

in the meanwhile, I don't quite understand the get_fe_gnl original function, my version crashes R

original

// [[Rcpp::export]]
List cpp_get_fe_gnl(int Q, int N, NumericVector sumFE, IntegerMatrix dumMat, IntegerVector cluster_sizes, IntegerVector obsCluster){
    // This function returns a list of the cluster coefficients for each cluster
    // dumMat: the matrix of cluster ID for each observation, with cpp index style
    // Q, N: nber of clusters / obs
    // cluster_sizes: vector of the number of cases per cluster
    // obsCluster: the integer vector that is equal to order(dum[[g]])
    // RETURN:
    // a list for each cluster of the cluster coefficient value
    // + the last element is the number of clusters that have been set as references (nb_ref)
    // => much optimized version May 2019

    int iter = 0, iterMax = 10000;
    int iter_loop = 0, iterMax_loop = 10000;

    // Creation of the indices to put all the cluster values into a single vector
    int nb_coef = 0;
    IntegerVector nb_ref(Q); // nb_ref takes the nb of elements set as ref

    for(int q=0 ; q<Q ; q++){
        // the total number of clusters (eg if c1: man/woman and c2: 10 countries: total of 12 cases)
        nb_coef += cluster_sizes(q);
    }

    NumericVector cluster_values(nb_coef);
    IntegerVector cluster_visited(nb_coef); // whether a value has been already assigned

    // index of the cluster
    std::vector<int*> pindex_cluster(Q);
    std::vector<int> index_cluster(nb_coef);

    for(int i=0 ; i<nb_coef ; i++){
        index_cluster[i] = i;
    }

    pindex_cluster[0] = index_cluster.data();
    for(int q=1 ; q<Q ; q++){
        pindex_cluster[q] = pindex_cluster[q - 1] + cluster_sizes(q - 1);
    }

    // Now we create the vector of observations for each cluster
    // we need a strating and an end vector as well
    IntegerVector start_cluster(nb_coef), end_cluster(nb_coef);

    int index;
    int k;

    for(int q=0 ; q<Q ; q++){

        // table cluster: nber of elements for each cluster class
        IntegerVector tableCluster(cluster_sizes(q));
        for(int i=0 ; i<N ; i++){
            k = dumMat(i, q);
            tableCluster(k) += 1; // the number of obs per case
        }

        // now creation of the start/end vectors
        for(int k=0 ; k<cluster_sizes(q) ; k++){
            int *pindex = pindex_cluster[q];
            index = pindex[k];
            // index = start(q) + k;
            if(k == 0){
                start_cluster[index] = 0;
                end_cluster[index] = tableCluster[k];
            } else {
                start_cluster[index] = end_cluster[index - 1];
                end_cluster[index] = end_cluster[index - 1] + tableCluster[k];
            }
        }
    }

    // matrix of the clusters that have been computed
    IntegerMatrix mat_done(N, Q);
    IntegerVector rowsums(N, 0);

    // vector of elements to loop over
    IntegerVector id2do(N);
    IntegerVector id2do_next(N);
    int nb2do = N, nb2do_next = N;
    for(int i=0 ; i<nb2do ; i++){
        id2do(i) = i;
        id2do_next(i) = i;
    }

    // Other indices and variables
    int qui_max, obs;
    int rs, rs_max; // rs: row sum
    int id_cluster;
    double other_value;
    bool first;

    //
    // THE MAIN LOOP
    //

    while(iter < iterMax){
        iter++;

        //
        // Finding the row where to put the 0s
        //

        if(iter == 1){
            // 1st iter, we select the first element
            qui_max = 0;
        } else {
            // we find the row that has the maximum of items done

            qui_max = 0;
            rs_max = 0;
            for(int i=0 ; i<nb2do ; i++){
                obs = id2do[i];

                rs = rowsums[obs];

                if(rs == Q-2){
                    // if rs == Q-2 => its the maximum possible, no need to go further
                    qui_max = obs;
                    break;
                } else if(rs<Q && rs>rs_max){
                    // this is to handle complicated cases with more than 3+ clusters
                    qui_max = obs;
                    rs_max = rs;
                } else if(qui_max == 0 && rs == 0){
                    // used to initialize qui_max
                    qui_max = obs;
                }
            }
        }

        // Rprintf("Obs selected: %i\n", qui_max);

        //
        // Putting the 0s, ie setting the references
        //

        // the first element is spared
        first = true;
        for(int q=0 ; q<Q ; q++){
            if(mat_done(qui_max, q) == 0){
                if(first){
                    // we spare the first element
                    first = false;
                } else {
                    // we set the cluster to 0
                    // 1) we find the cluster
                    id_cluster = dumMat(qui_max, q);
                    // Rprintf("Cluster: %i\n", id_cluster + 1);
                    // 2) we get the index of the cluster vector
                    int *pindex = pindex_cluster[q];
                    index = pindex[id_cluster];
                    // 3) we set the cluster value to 0
                    cluster_values(index) = 0;
                    // 4) we update the mat_done matrix for the elements of this cluster
                    for(int i=start_cluster[index] ; i<end_cluster[index] ; i++){
                        obs = obsCluster(i, q);
                        mat_done(obs, q) = 1;
                        rowsums[obs]++;
                    }
                    // 5) => we save the information on which cluster was set as a reference
                    nb_ref(q)++;
                }
            }
        }

        //
        // LOOP OF ALL OTHER UPDATES (CRITICAL)
        //

        iter_loop = 0;
        while(iter_loop < iterMax_loop){
            iter_loop++;

            // Rprintf("nb2do_next: %i -- nb2do: %i\n", nb2do_next, nb2do);

            R_CheckUserInterrupt();

            //
            // Selection of indexes (new way) to be updated
            //

            // initialisation of the observations to cover (before first loop the two are identical)
            if(iter_loop != 1){
                nb2do = nb2do_next;
                for(int i=0 ; i<nb2do ; i++){
                    id2do[i] = id2do_next[i];
                }
            }

            nb2do_next = 0;

            for(int i=0 ; i<nb2do ; i++){
                // we compute the rowsum of the obs that still needs to be done
                obs = id2do[i];

                rs = rowsums[obs];

                if(rs < Q-1){
                    // you need to check it next time!
                    id2do_next[nb2do_next] = obs;
                    nb2do_next++;
                } else if(rs == Q-1){
                    // means: needs to be updated
                    int q;
                    for(q=0 ; mat_done(obs, q)!=0 ; q++){
                        // selection of the q that is equal to 0
                    }

                    int *pindex = pindex_cluster[q];
                    int index_select = pindex[dumMat(obs, q)];

                    // Finding the value of the cluster coefficient
                    other_value = 0;
                    // Computing the sum of the other cluster values
                    // and finding the cluster to be updated (q_select)
                    for(int l=0 ; l<Q ; l++){
                        // we can loop over all q because cluster_values is initialized to 0
                        index = pindex_cluster[l][dumMat(obs, l)];
                        other_value += cluster_values(index);
                    }

                    // the index to update
                    cluster_values(index_select) = sumFE(obs) - other_value;

                    // Update of the mat_done
                    for(int j=start_cluster[index_select] ; j<end_cluster[index_select] ; j++){
                        obs = obsCluster(j, q);
                        mat_done(obs, q) = 1;
                        rowsums[obs]++;
                    }
                }
            }

            if(nb2do_next == nb2do) break;

        }

        // out of this loop:
        //  - nb2do == nb2do_next
        // - id2do == id2do_next

        // Check that everything is all right
        if(iter_loop == iterMax_loop){
            Rprintf("Problem getting FE, maximum iterations reached (2nd order loop).");
        }

        // if no more obs to be covered
        if(nb2do_next == 0) break;

    }

    if(iter == iterMax){
        Rprintf("Problem getting FE, maximum iterations reached (1st order loop).");
    }

    // final formatting and save
    List res(Q + 1);

    int *pindex;
    for(int q=0 ; q<Q ; q++){
        NumericVector quoi(cluster_sizes(q));
        pindex = pindex_cluster[q];
        for(k=0 ; k<cluster_sizes(q) ; k++){
            index = pindex[k];
            // index = start(q) + k;
            quoi(k) = cluster_values(index);
        }
        res(q) = quoi;
    }

    res(Q) = nb_ref;

    return(res);
}

my version

[[cpp11::register]] list cpp_get_fe_gnl_(
    int Q, int N, writable::doubles sumFE, writable::integers_matrix<> dumMat,
    writable::integers cluster_sizes, writable::integers_matrix<> obsCluster) {
  // This function returns a list of the cluster coefficients for each cluster
  // dumMat: the matrix of cluster ID for each observation, with cpp index style
  // Q, N: nber of clusters / obs
  // cluster_sizes: vector of the number of cases per cluster
  // obsCluster: the integer vector that is equal to order(dum[[g]])
  // RETURN:
  // a list for each cluster of the cluster coefficient value
  // + the last element is the number of clusters that have been set as
  // references (nb_ref)
  // => much optimized version May 2019

  int iter = 0, iterMax = 10000;
  int iter_loop = 0, iterMax_loop = 10000;

  // Creation of the indices to put all the cluster values into a single vector
  int nb_coef = 0;
  std::vector<int> nb_ref(Q);  // nb_ref takes the nb of elements set as ref

  for (int q = 0; q < Q; q++) {
    // the total number of clusters (eg if c1: man/woman and c2: 10 countries:
    // total of 12 cases)
    nb_coef += cluster_sizes(q);
  }

  std::vector<double> cluster_values(nb_coef);
  std::vector<int> cluster_visited(
      nb_coef);  // whether a value has been already assigned

  // index of the cluster
  std::vector<int *> pindex_cluster(Q);
  std::vector<int> index_cluster(nb_coef);

  for (int i = 0; i < nb_coef; i++) {
    index_cluster[i] = i;
  }

  std::vector<int *> pindex_cluster(Q);
  pindex_cluster[0] = index_cluster.data();
  for (int q = 1; q < Q; q++) {
    pindex_cluster[q] = pindex_cluster[q - 1] + cluster_sizes(q - 1);
  }

  // Now we create the vector of observations for each cluster
  // we need a strating and an end vector as well
  std::vector<int> start_cluster(nb_coef), end_cluster(nb_coef);

  int index;
  int k;

  for (int q = 0; q < Q; q++) {
    // table cluster: nber of elements for each cluster class
    // Create a vector of integers called `tableCluster`
    // The size of the vector is determined by the q-th element of
    // the cluster_sizes vector

    std::vector<int> tableCluster(cluster_sizes(q));
    for (int i = 0; i < N; i++) {
      k = dumMat(i, q);
      tableCluster[k] += 1;  // the number of obs per case
    }

    cout << "\n BEGIN tableCluster\n" << endl;
    // print all the elements of the tableCluster
    for (int i = 0; i < cluster_sizes[q]; i++) {
      cout << tableCluster[i] << endl;
    }
    cout << "\n END tableCluster\n" << endl;

    // now creation of the start/end vectors
    for (int k = 0; k < cluster_sizes[q]; k++) {
      int *pindex = pindex_cluster[q];
      index = pindex[k];
      // index = start(q) + k;
      if (k == 0) {
        start_cluster[index] = 0;
        end_cluster[index] = tableCluster[k];
      } else {
        start_cluster[index] = end_cluster[index - 1];
        end_cluster[index] = end_cluster[index - 1] + tableCluster[k];
      }

      cout << start_cluster[index] << "\n"
           << end_cluster[index] << "\n\n"
           << endl;
    }
  }

  // matrix of the clusters that have been computed
  std::vector<std::vector<int>> mat_done(N, std::vector<int>(Q));
  std::vector<int> rowsums(N, 0);

  // vector of elements to loop over
  std::vector<int> id2do(N);
  std::vector<int> id2do_next(N);
  int nb2do = N, nb2do_next = N;
  for (int i = 0; i < nb2do; i++) {
    id2do[i] = i;
    id2do_next[i] = i;
  }

  // Other indices and variables
  int qui_max, obs;
  int rs, rs_max;  // rs: row sum
  int id_cluster;
  double other_value;
  bool first;

  //
  // THE MAIN LOOP
  //

  while (iter < iterMax) {
    iter++;

    // Finding the row where to put the 0s

    if (iter == 1) {
      // 1st iter, we select the first element
      qui_max = 0;
    } else {
      // we find the row that has the maximum of items done

      int qui_max = 0;
      int rs_max = 0;
      for (int i = 0; i < nb2do; i++) {
        int obs = id2do[i];
        int rs = rowsums[obs];

        if (rs == Q - 2) {
          qui_max = obs;
          break;
        } else if (rs < Q && rs > rs_max) {
          qui_max = obs;
          rs_max = rs;
        } else if (qui_max == 0 && rs == 0) {
          qui_max = obs;
        }
      }
    }

    // Putting the 0s, ie setting the references

    // the first element is spared
    bool first = true;
    for (int q = 0; q < Q; q++) {
      if (mat_done[qui_max][q] == 0) {
        if (first) {
          // we spare the first element
          first = false;
        } else {
          // we set the cluster to 0
          // 1) we find the cluster
          id_cluster = dumMat(qui_max, q);
          // Rprintf("Cluster: %i\n", id_cluster + 1);
          // 2) we get the index of the cluster vector
          int *pindex = pindex_cluster[q];
          index = pindex[id_cluster];
          // 3) we set the cluster value to 0
          cluster_values[index] = 0;
          // 4) we update the mat_done matrix for the elements of this cluster
          for (int i = start_cluster[index]; i < end_cluster[index]; i++) {
            obs = obsCluster(i, q);
            mat_done[obs][q] = 1;
            rowsums[obs]++;
          }
          // 5) => we save the information on which cluster was set as a
          // reference
          nb_ref[q]++;
        }
      }
    }

    // LOOP OF ALL OTHER UPDATES (CRITICAL)

    iter_loop = 0;
    while (iter_loop < iterMax_loop) {
      iter_loop++;

      R_CheckUserInterrupt();

      // Selection of indexes (new way) to be updated

      // initialisation of the observations to cover (before first loop the two
      // are identical)
      if (iter_loop != 1) {
        nb2do = nb2do_next;
        for (int i = 0; i < nb2do; i++) {
          id2do[i] = id2do_next[i];
        }
      }

      nb2do_next = 0;

      for (int i = 0; i < nb2do; i++) {
        // we compute the rowsum of the obs that still needs to be done
        obs = id2do[i];

        rs = rowsums[obs];

        if (rs < Q - 1) {
          // you need to check it next time!
          id2do_next[nb2do_next] = obs;
          nb2do_next++;
        } else if (rs == Q - 1) {
          // means: needs to be updated
          int q = 0;
          while (mat_done[obs][q] != 0) {
            q++;
          }

          int *pindex = pindex_cluster[q];
          int index_select = pindex[dumMat[obs][q]];

          // Finding the value of the cluster coefficient
          double other_value = 0;
          // Computing the sum of the other cluster values
          // and finding the cluster to be updated (q_select)
          for (int l = 0; l < Q; l++) {
            // we can loop over all q because cluster_values is initialized to 0
            index = pindex_cluster[l][dumMat[obs][l]];
            other_value += cluster_values[index];
          }

          // the index to update
          cluster_values[index_select] = sumFE[obs] - other_value;

          // Update of the mat_done
          for (int j = start_cluster[index_select];
               j < end_cluster[index_select]; j++) {
            obs = obsCluster(j, q);
            mat_done[obs][q] = 1;
            rowsums[obs]++;
          }
        }
      }

      if (nb2do_next == nb2do) break;
    }

    // out of this loop:
    //  - nb2do == nb2do_next
    // - id2do == id2do_next

    // Check that everything is all right
    if (iter_loop == iterMax_loop) {
      Rprintf(
          "Problem getting FE, maximum iterations reached (2nd order loop).");
    }

    // if no more obs to be covered
    if (nb2do_next == 0) break;
  }

  if (iter == iterMax) {
    Rprintf("Problem getting FE, maximum iterations reached (1st order loop).");
  }

  // final formatting and save
  std::vector<std::vector<double>> res(Q);
  int *pindex;
  for (int q = 0; q < Q; q++) {
    std::vector<double> quoi(cluster_sizes[q]);
    pindex = pindex_cluster[q];
    for (int k = 0; k < cluster_sizes[q]; k++) {
      index = pindex[k];
      quoi[k] = cluster_values[index];
    }
    res[q] = quoi;
  }
  res.push_back(nb_ref);

  return res;
}