Feols: speed up the creation of interacted fixed effects via `fe1^fe2` syntax

leostimpfle commented 3 weeks ago

Vectorizes the creation of interacted fixed effects by using pd.Series.str.cat instead of the row-wisejoin in pd.Series.apply.

This PR aims to resolve py-econometrics/pyfixest#470

s3alfisc commented 3 weeks ago

Cool! Thanks @leostimpfle - I'm at the gym at the moment but will take a look once I'm back =)

s3alfisc commented 3 weeks ago

Indeed much faster:

import pyfixest as pf
import time
import pandas as pd
import numpy as np

df = pf.get_data(N = 10_000)
df.head()

fval = "f1^f2+f3"

data = df.copy()

tic = time.time()

for val in fval.split("+"):
    if "^" in val:
        vars = val.split("^")
        data[val.replace("^", "_")] = (
            data[vars[0]]
            .astype(pd.StringDtype())
            .str.cat(
                data[vars[1:]].astype(pd.StringDtype()),
                sep="^",
                na_rep=None,  # a row containing a missing value in any of the columns (before concatenation) will have a missing value in the result: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.cat.html
            )
        )

toc = time.time()
print(toc - tic)
# 0.016783714294433594

data.head()

data2 = df.copy()
tic = time.time()

for val in fval.split("+"):
    if "^" in val:
        vars = val.split("^")
        data2[val.replace("^", "_")] = data2[vars].apply(
            lambda x: (
                "^".join(x.dropna().astype(str)) if x.notna().all() else np.nan
            ),
            axis=1,
        )

toc = time.time()
print(toc - tic)
# 2.3184027671813965

data2.head()

And produces identical results!

The data type of "f1_f2" changes from object to str, but this doesn't matter - the fixed effects are converted to int at a later point in the code base, as demean requires integers as inputs.

This can be merged =) Thanks you and congrats to your first contribution to PyFixest @leostimpfle! 🎉

s3alfisc commented 3 weeks ago

I will merge after the CI tests pass =)

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Files	Coverage Δ
pyfixest/estimation/model_matrix_fixest_.py	`93.85% <100.00%> (ø)`

... and 29 files with indirect coverage changes

py-econometrics / pyfixest

Feols: speed up the creation of interacted fixed effects via `fe1^fe2` syntax #475

Codecov Report