Open pachadotdev opened 1 year ago
Hi @lrberge! Did you see this excellent idea in Python? I have nothing to do with it, and I want to see it completed to compare my results in R against a Python version.
Hi @pachadotdev - thanks for pointing me to your fixest2
project - it looks pretty neat! I think we have talked about it at some point in the ropensci slack? I am surprised I did not star it back in the day, to be honest =)
Eventually I would love to move the heavy lifting - mostly the alternating projections algorithm - to a lower lever back end. So please keep me in the loop on your progress!
I have set up some tests against fixest
via rpy2
and run them via github CI - you can find the tests here and the CI yaml here. Maybe this will be useful for you?
thanks! https://github.com/s3alfisc/pyfixest/blob/master/tests/test_vs_fixest.py is very useful indeed! I'll keep you posted, and eventually ppl should be able to call fixest2 from Python without calling R
also, I think so about the ropensci slack, I do not connect so often
Regarding the python port, IMO the backend for the demeaning would be the best way to go and the effort wouldn't be that important since everything to be dealt with lies in a single function, cpp_demean.
That function takes in R objects and converts everything into a C++ structure (in particular via the sMat and sVec classes and their initialization which can easily be adapted). The demeaning algorithm is full C++ and then is portable.
Only the very first input (transforming/adapting python objects into C++ structures) and the output need to be taken care of. It looks doable but requires someone knowledgeable in python data structures.
One word of caution though: the API needs to be constructed carefully (currently the input to cpp_demean may be too flexible / what is the best data structure in output?) and this may need some back and forth.
Hi @lrberge - thanks for chiming in here =) Great that the demeaning actually "lives" in only one function. That should definitely help bring the task within my range of skill. Though I am definitely not an expert on Python /c++ data types and also still have to get up to speed with how to integrate c++ into python (the most popular options seem to be either cython or pybind). I'll ping both of you once I have played around with both frameworks and your c++ codes, I am certain that I will need some help =)
@s3alfisc I think that "copying" from Apache Arrow would be great. They wrote the main library in C++ and then added Python, R, JavaScript etc layers to use Arrow in other languages. The good news is that a big part of fixest is C++ but in Rcpp flavour. Let's say, fixest is vanilla yogurt, but we need unflavored yogurt to call it from Python, if we "cook" it a bit, we can make it "taste" different. What I made with cpp11 is "low calorie vanilla", but still needs additional processing. i shall write on ropensci about this idea.
Great metaphors @pachadotdev! 😄
also @s3alfisc the cpp11 version (https://github.com/pachadotdev/fixest2/blob/cpp11_wip/R/cpp11_default_parameters.R) has functions()
and functions_()
, which is an initial step to think about separating the C++ backend from the frontend in R/Python/Java/etc. When you call fun_a()
then that function calls fun_a_()
and passes default params to it.
Hi @pachadotdev and @lrberge - I have just finished implementing a python clone of fixest's demeaning function. I use numba, which is JIT compiled and promises 'c like performance with python syntax'. It's a very naive implementation of the MAP algorithm and shockingly (but maybe unsurprisingly ?) around 10 - 20 times slower than fixest. You can find the implementation here.
I have tried to look at the source code of cpp_demean
to understand if fixest uses any tricks to squeeze out better performance, but did not get any smarter (turns out I am not very good at reading c++ code). I have a few questions on the algo, and if you have the time, it would be great if you could answer them!
fixest::demean()
has a reorder
argument and I don't really understand the docs - what does the argument do exactly? Thanks! =)
As a sidepoint, my main takeaway is that I should probably just take the bullet and read up on integrating c++ code into a python package =D
This is amazing!
Fixest uses OMP and does what you said.
Regarding the C++ integration, let me get a 100% working cpp11 version (unofficial), and we can connect it directly with Python.
Best,
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Alexander Fischer @.> Sent: Sunday, June 4, 2023 7:56:45 AM To: s3alfisc/pyfixest @.> Cc: Mauricio Andres Vargas Sepulveda @.>; Mention @.> Subject: Re: [s3alfisc/pyfixest] C++ 11 backend via fixest/fixest2 API (Issue #66)
As a sidepoint, my main takeaway is that I should probably just take the bullet and read up on integrating c++ code into a python package =D
— Reply to this email directly, view it on GitHubhttps://github.com/s3alfisc/pyfixest/issues/66#issuecomment-1575539971, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACM7UOP2ISDEZHHRPM4MUCDXJRZX3ANCNFSM6AAAAAAXRPRYDI. You are receiving this because you were mentioned.Message ID: @.***>
Just spotted this, and super excited to see the "Arrow for regressions" idea becoming reality @pachadotdev and @lrberge!
@s3alfisc i started here https://github.com/pachadotdev/pyfixest
in the meanwhile, I don't quite understand the get_fe_gnl original function, my version crashes R
original
// [[Rcpp::export]]
List cpp_get_fe_gnl(int Q, int N, NumericVector sumFE, IntegerMatrix dumMat, IntegerVector cluster_sizes, IntegerVector obsCluster){
// This function returns a list of the cluster coefficients for each cluster
// dumMat: the matrix of cluster ID for each observation, with cpp index style
// Q, N: nber of clusters / obs
// cluster_sizes: vector of the number of cases per cluster
// obsCluster: the integer vector that is equal to order(dum[[g]])
// RETURN:
// a list for each cluster of the cluster coefficient value
// + the last element is the number of clusters that have been set as references (nb_ref)
// => much optimized version May 2019
int iter = 0, iterMax = 10000;
int iter_loop = 0, iterMax_loop = 10000;
// Creation of the indices to put all the cluster values into a single vector
int nb_coef = 0;
IntegerVector nb_ref(Q); // nb_ref takes the nb of elements set as ref
for(int q=0 ; q<Q ; q++){
// the total number of clusters (eg if c1: man/woman and c2: 10 countries: total of 12 cases)
nb_coef += cluster_sizes(q);
}
NumericVector cluster_values(nb_coef);
IntegerVector cluster_visited(nb_coef); // whether a value has been already assigned
// index of the cluster
std::vector<int*> pindex_cluster(Q);
std::vector<int> index_cluster(nb_coef);
for(int i=0 ; i<nb_coef ; i++){
index_cluster[i] = i;
}
pindex_cluster[0] = index_cluster.data();
for(int q=1 ; q<Q ; q++){
pindex_cluster[q] = pindex_cluster[q - 1] + cluster_sizes(q - 1);
}
// Now we create the vector of observations for each cluster
// we need a strating and an end vector as well
IntegerVector start_cluster(nb_coef), end_cluster(nb_coef);
int index;
int k;
for(int q=0 ; q<Q ; q++){
// table cluster: nber of elements for each cluster class
IntegerVector tableCluster(cluster_sizes(q));
for(int i=0 ; i<N ; i++){
k = dumMat(i, q);
tableCluster(k) += 1; // the number of obs per case
}
// now creation of the start/end vectors
for(int k=0 ; k<cluster_sizes(q) ; k++){
int *pindex = pindex_cluster[q];
index = pindex[k];
// index = start(q) + k;
if(k == 0){
start_cluster[index] = 0;
end_cluster[index] = tableCluster[k];
} else {
start_cluster[index] = end_cluster[index - 1];
end_cluster[index] = end_cluster[index - 1] + tableCluster[k];
}
}
}
// matrix of the clusters that have been computed
IntegerMatrix mat_done(N, Q);
IntegerVector rowsums(N, 0);
// vector of elements to loop over
IntegerVector id2do(N);
IntegerVector id2do_next(N);
int nb2do = N, nb2do_next = N;
for(int i=0 ; i<nb2do ; i++){
id2do(i) = i;
id2do_next(i) = i;
}
// Other indices and variables
int qui_max, obs;
int rs, rs_max; // rs: row sum
int id_cluster;
double other_value;
bool first;
//
// THE MAIN LOOP
//
while(iter < iterMax){
iter++;
//
// Finding the row where to put the 0s
//
if(iter == 1){
// 1st iter, we select the first element
qui_max = 0;
} else {
// we find the row that has the maximum of items done
qui_max = 0;
rs_max = 0;
for(int i=0 ; i<nb2do ; i++){
obs = id2do[i];
rs = rowsums[obs];
if(rs == Q-2){
// if rs == Q-2 => its the maximum possible, no need to go further
qui_max = obs;
break;
} else if(rs<Q && rs>rs_max){
// this is to handle complicated cases with more than 3+ clusters
qui_max = obs;
rs_max = rs;
} else if(qui_max == 0 && rs == 0){
// used to initialize qui_max
qui_max = obs;
}
}
}
// Rprintf("Obs selected: %i\n", qui_max);
//
// Putting the 0s, ie setting the references
//
// the first element is spared
first = true;
for(int q=0 ; q<Q ; q++){
if(mat_done(qui_max, q) == 0){
if(first){
// we spare the first element
first = false;
} else {
// we set the cluster to 0
// 1) we find the cluster
id_cluster = dumMat(qui_max, q);
// Rprintf("Cluster: %i\n", id_cluster + 1);
// 2) we get the index of the cluster vector
int *pindex = pindex_cluster[q];
index = pindex[id_cluster];
// 3) we set the cluster value to 0
cluster_values(index) = 0;
// 4) we update the mat_done matrix for the elements of this cluster
for(int i=start_cluster[index] ; i<end_cluster[index] ; i++){
obs = obsCluster(i, q);
mat_done(obs, q) = 1;
rowsums[obs]++;
}
// 5) => we save the information on which cluster was set as a reference
nb_ref(q)++;
}
}
}
//
// LOOP OF ALL OTHER UPDATES (CRITICAL)
//
iter_loop = 0;
while(iter_loop < iterMax_loop){
iter_loop++;
// Rprintf("nb2do_next: %i -- nb2do: %i\n", nb2do_next, nb2do);
R_CheckUserInterrupt();
//
// Selection of indexes (new way) to be updated
//
// initialisation of the observations to cover (before first loop the two are identical)
if(iter_loop != 1){
nb2do = nb2do_next;
for(int i=0 ; i<nb2do ; i++){
id2do[i] = id2do_next[i];
}
}
nb2do_next = 0;
for(int i=0 ; i<nb2do ; i++){
// we compute the rowsum of the obs that still needs to be done
obs = id2do[i];
rs = rowsums[obs];
if(rs < Q-1){
// you need to check it next time!
id2do_next[nb2do_next] = obs;
nb2do_next++;
} else if(rs == Q-1){
// means: needs to be updated
int q;
for(q=0 ; mat_done(obs, q)!=0 ; q++){
// selection of the q that is equal to 0
}
int *pindex = pindex_cluster[q];
int index_select = pindex[dumMat(obs, q)];
// Finding the value of the cluster coefficient
other_value = 0;
// Computing the sum of the other cluster values
// and finding the cluster to be updated (q_select)
for(int l=0 ; l<Q ; l++){
// we can loop over all q because cluster_values is initialized to 0
index = pindex_cluster[l][dumMat(obs, l)];
other_value += cluster_values(index);
}
// the index to update
cluster_values(index_select) = sumFE(obs) - other_value;
// Update of the mat_done
for(int j=start_cluster[index_select] ; j<end_cluster[index_select] ; j++){
obs = obsCluster(j, q);
mat_done(obs, q) = 1;
rowsums[obs]++;
}
}
}
if(nb2do_next == nb2do) break;
}
// out of this loop:
// - nb2do == nb2do_next
// - id2do == id2do_next
// Check that everything is all right
if(iter_loop == iterMax_loop){
Rprintf("Problem getting FE, maximum iterations reached (2nd order loop).");
}
// if no more obs to be covered
if(nb2do_next == 0) break;
}
if(iter == iterMax){
Rprintf("Problem getting FE, maximum iterations reached (1st order loop).");
}
// final formatting and save
List res(Q + 1);
int *pindex;
for(int q=0 ; q<Q ; q++){
NumericVector quoi(cluster_sizes(q));
pindex = pindex_cluster[q];
for(k=0 ; k<cluster_sizes(q) ; k++){
index = pindex[k];
// index = start(q) + k;
quoi(k) = cluster_values(index);
}
res(q) = quoi;
}
res(Q) = nb_ref;
return(res);
}
my version
[[cpp11::register]] list cpp_get_fe_gnl_(
int Q, int N, writable::doubles sumFE, writable::integers_matrix<> dumMat,
writable::integers cluster_sizes, writable::integers_matrix<> obsCluster) {
// This function returns a list of the cluster coefficients for each cluster
// dumMat: the matrix of cluster ID for each observation, with cpp index style
// Q, N: nber of clusters / obs
// cluster_sizes: vector of the number of cases per cluster
// obsCluster: the integer vector that is equal to order(dum[[g]])
// RETURN:
// a list for each cluster of the cluster coefficient value
// + the last element is the number of clusters that have been set as
// references (nb_ref)
// => much optimized version May 2019
int iter = 0, iterMax = 10000;
int iter_loop = 0, iterMax_loop = 10000;
// Creation of the indices to put all the cluster values into a single vector
int nb_coef = 0;
std::vector<int> nb_ref(Q); // nb_ref takes the nb of elements set as ref
for (int q = 0; q < Q; q++) {
// the total number of clusters (eg if c1: man/woman and c2: 10 countries:
// total of 12 cases)
nb_coef += cluster_sizes(q);
}
std::vector<double> cluster_values(nb_coef);
std::vector<int> cluster_visited(
nb_coef); // whether a value has been already assigned
// index of the cluster
std::vector<int *> pindex_cluster(Q);
std::vector<int> index_cluster(nb_coef);
for (int i = 0; i < nb_coef; i++) {
index_cluster[i] = i;
}
std::vector<int *> pindex_cluster(Q);
pindex_cluster[0] = index_cluster.data();
for (int q = 1; q < Q; q++) {
pindex_cluster[q] = pindex_cluster[q - 1] + cluster_sizes(q - 1);
}
// Now we create the vector of observations for each cluster
// we need a strating and an end vector as well
std::vector<int> start_cluster(nb_coef), end_cluster(nb_coef);
int index;
int k;
for (int q = 0; q < Q; q++) {
// table cluster: nber of elements for each cluster class
// Create a vector of integers called `tableCluster`
// The size of the vector is determined by the q-th element of
// the cluster_sizes vector
std::vector<int> tableCluster(cluster_sizes(q));
for (int i = 0; i < N; i++) {
k = dumMat(i, q);
tableCluster[k] += 1; // the number of obs per case
}
cout << "\n BEGIN tableCluster\n" << endl;
// print all the elements of the tableCluster
for (int i = 0; i < cluster_sizes[q]; i++) {
cout << tableCluster[i] << endl;
}
cout << "\n END tableCluster\n" << endl;
// now creation of the start/end vectors
for (int k = 0; k < cluster_sizes[q]; k++) {
int *pindex = pindex_cluster[q];
index = pindex[k];
// index = start(q) + k;
if (k == 0) {
start_cluster[index] = 0;
end_cluster[index] = tableCluster[k];
} else {
start_cluster[index] = end_cluster[index - 1];
end_cluster[index] = end_cluster[index - 1] + tableCluster[k];
}
cout << start_cluster[index] << "\n"
<< end_cluster[index] << "\n\n"
<< endl;
}
}
// matrix of the clusters that have been computed
std::vector<std::vector<int>> mat_done(N, std::vector<int>(Q));
std::vector<int> rowsums(N, 0);
// vector of elements to loop over
std::vector<int> id2do(N);
std::vector<int> id2do_next(N);
int nb2do = N, nb2do_next = N;
for (int i = 0; i < nb2do; i++) {
id2do[i] = i;
id2do_next[i] = i;
}
// Other indices and variables
int qui_max, obs;
int rs, rs_max; // rs: row sum
int id_cluster;
double other_value;
bool first;
//
// THE MAIN LOOP
//
while (iter < iterMax) {
iter++;
// Finding the row where to put the 0s
if (iter == 1) {
// 1st iter, we select the first element
qui_max = 0;
} else {
// we find the row that has the maximum of items done
int qui_max = 0;
int rs_max = 0;
for (int i = 0; i < nb2do; i++) {
int obs = id2do[i];
int rs = rowsums[obs];
if (rs == Q - 2) {
qui_max = obs;
break;
} else if (rs < Q && rs > rs_max) {
qui_max = obs;
rs_max = rs;
} else if (qui_max == 0 && rs == 0) {
qui_max = obs;
}
}
}
// Putting the 0s, ie setting the references
// the first element is spared
bool first = true;
for (int q = 0; q < Q; q++) {
if (mat_done[qui_max][q] == 0) {
if (first) {
// we spare the first element
first = false;
} else {
// we set the cluster to 0
// 1) we find the cluster
id_cluster = dumMat(qui_max, q);
// Rprintf("Cluster: %i\n", id_cluster + 1);
// 2) we get the index of the cluster vector
int *pindex = pindex_cluster[q];
index = pindex[id_cluster];
// 3) we set the cluster value to 0
cluster_values[index] = 0;
// 4) we update the mat_done matrix for the elements of this cluster
for (int i = start_cluster[index]; i < end_cluster[index]; i++) {
obs = obsCluster(i, q);
mat_done[obs][q] = 1;
rowsums[obs]++;
}
// 5) => we save the information on which cluster was set as a
// reference
nb_ref[q]++;
}
}
}
// LOOP OF ALL OTHER UPDATES (CRITICAL)
iter_loop = 0;
while (iter_loop < iterMax_loop) {
iter_loop++;
R_CheckUserInterrupt();
// Selection of indexes (new way) to be updated
// initialisation of the observations to cover (before first loop the two
// are identical)
if (iter_loop != 1) {
nb2do = nb2do_next;
for (int i = 0; i < nb2do; i++) {
id2do[i] = id2do_next[i];
}
}
nb2do_next = 0;
for (int i = 0; i < nb2do; i++) {
// we compute the rowsum of the obs that still needs to be done
obs = id2do[i];
rs = rowsums[obs];
if (rs < Q - 1) {
// you need to check it next time!
id2do_next[nb2do_next] = obs;
nb2do_next++;
} else if (rs == Q - 1) {
// means: needs to be updated
int q = 0;
while (mat_done[obs][q] != 0) {
q++;
}
int *pindex = pindex_cluster[q];
int index_select = pindex[dumMat[obs][q]];
// Finding the value of the cluster coefficient
double other_value = 0;
// Computing the sum of the other cluster values
// and finding the cluster to be updated (q_select)
for (int l = 0; l < Q; l++) {
// we can loop over all q because cluster_values is initialized to 0
index = pindex_cluster[l][dumMat[obs][l]];
other_value += cluster_values[index];
}
// the index to update
cluster_values[index_select] = sumFE[obs] - other_value;
// Update of the mat_done
for (int j = start_cluster[index_select];
j < end_cluster[index_select]; j++) {
obs = obsCluster(j, q);
mat_done[obs][q] = 1;
rowsums[obs]++;
}
}
}
if (nb2do_next == nb2do) break;
}
// out of this loop:
// - nb2do == nb2do_next
// - id2do == id2do_next
// Check that everything is all right
if (iter_loop == iterMax_loop) {
Rprintf(
"Problem getting FE, maximum iterations reached (2nd order loop).");
}
// if no more obs to be covered
if (nb2do_next == 0) break;
}
if (iter == iterMax) {
Rprintf("Problem getting FE, maximum iterations reached (1st order loop).");
}
// final formatting and save
std::vector<std::vector<double>> res(Q);
int *pindex;
for (int q = 0; q < Q; q++) {
std::vector<double> quoi(cluster_sizes[q]);
pindex = pindex_cluster[q];
for (int k = 0; k < cluster_sizes[q]; k++) {
index = pindex[k];
quoi[k] = cluster_values[index];
}
res[q] = quoi;
}
res.push_back(nb_ref);
return res;
}
Hi! I have a work in progress that I am writing in a more "pure" C++ grammar: https://github.com/pachadotdev/fixest2/tree/cpp11_wip My idea is to eventually create something like Apache Arrow, where the core functionality is independent from R, and it provides the same set of function on R, Python, JavaScript, etc. In this case, my goal will be covering R and Python. One of the problems that I have now are unassigned memory and memory leaks, and I need to debug those. At the moment, simple OLS works well and it was rewritten in a header-only C++ backend with the idea of also using vendoring at some point.