Closed leostimpfle closed 3 weeks ago
Cool! Thanks @leostimpfle - I'm at the gym at the moment but will take a look once I'm back =)
Indeed much faster:
import pyfixest as pf
import time
import pandas as pd
import numpy as np
df = pf.get_data(N = 10_000)
df.head()
fval = "f1^f2+f3"
data = df.copy()
tic = time.time()
for val in fval.split("+"):
if "^" in val:
vars = val.split("^")
data[val.replace("^", "_")] = (
data[vars[0]]
.astype(pd.StringDtype())
.str.cat(
data[vars[1:]].astype(pd.StringDtype()),
sep="^",
na_rep=None, # a row containing a missing value in any of the columns (before concatenation) will have a missing value in the result: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.cat.html
)
)
toc = time.time()
print(toc - tic)
# 0.016783714294433594
data.head()
data2 = df.copy()
tic = time.time()
for val in fval.split("+"):
if "^" in val:
vars = val.split("^")
data2[val.replace("^", "_")] = data2[vars].apply(
lambda x: (
"^".join(x.dropna().astype(str)) if x.notna().all() else np.nan
),
axis=1,
)
toc = time.time()
print(toc - tic)
# 2.3184027671813965
data2.head()
And produces identical results!
The data type of "f1_f2" changes from object to str, but this doesn't matter - the fixed effects are converted to int at a later point in the code base, as demean
requires integers as inputs.
This can be merged =) Thanks you and congrats to your first contribution to PyFixest @leostimpfle! 🎉
I will merge after the CI tests pass =)
All modified and coverable lines are covered by tests :white_check_mark:
Files | Coverage Δ | |
---|---|---|
pyfixest/estimation/model_matrix_fixest_.py | 93.85% <100.00%> (ø) |
Vectorizes the creation of interacted fixed effects by using
pd.Series.str.cat
instead of the row-wisejoin
inpd.Series.apply
.This PR aims to resolve py-econometrics/pyfixest#470