Open Wenzhi-Ding opened 6 months ago
Hi Dave, this is definitely a very interesting suggestion! fixest
allows for macros via xpd() but I have to admit that I never really used it. It's also a very advanced functionality, so it might be ok to deviate from fixest
syntax.
More concretely, I really like this syntax
pf.feols("y~x*)
but we'd have to be careful, as *
is a special symbol with Wilkinson formulas and if used slightly different, will produce an interaction term, i.e. y ~ x1*x2
evaluates to y ~ x1 + x2 + x1:x2
.
More generally it would be amazing if you were to take a look at formula parsing via FormulaParser
, which in my opinion it is the least clean part of the code base and will eventually require a major overwhaul π
Thanks for pointing out the Wilkinson formula problem! I thought for a while and maybe there are two alternatives:
Solution 1: Provide a helper function to resemble variable list
Xs = pf.varlist("X*") # Xs = "X1 + X2"
pf.feols(f"y ~ {Xs} + f1", data=df)
Solution 2: Parse the varlist
function in formula
pf.feols("y ~ vlist(X*) + f1", data=df)
Then we can replacevlist(X*)
by "X1 + X2"
first, and then pass the formula to formulaic
. In this way, the formulaic
always receive un-confusing strings.
Solution 2 looks easier for users to learn since it does not require users to have knowledge of f-string. If using solution 2, what name should this function called?
v()
or V()
: Easiest to type, but may not be easy to understand.vlist()
list()
varlist()
How do you think Alex?
Hi Dave, I like both solutions!
I don't think that syntax as Y ~ X*
is a problem per se, it might just work as well. Though I have a slight preference for either the helper method or direct formula parsing. I just went through the formulaic
docs (the library that pyfixest
uses to create the design matrices) and it looks like there might be a way to add custom operators. But for the beginning, it probably makes a lot of sense to keep the parsing within pyfixest
. Though I will open an issue in formulaic
to maybe get some guidelines on how one would add a custom parser.
For the immediate future, I'd be happy to accept any of the three solutions you have proposed!
Unsolicited feedback: Personally, I strongly recommend against adopting any Stata-specific syntax above (R) fixest's native API if you can avoid it.
It's obvs perfectly fine to write a wrapper package that does this for users that want it. (Or implement it as bonus functionality on top of the mimicked API). But similar to my comment here, I think you do a disservice to both packages, and the overall goal of this project, if the API begins to drift away from the upstream R fixest package.
P.S. Ideally, a Stata user could simply consult https://stata2r.github.io/fixest/ and make the direct translation to PyFixest. That's the beauty of having a consistent API across both packages ;-)
Hi @grantmcdermott, thank you so much for pointing out the https://stata2r.github.io/fixest/ website! It's really informative. So fixest
has already supported such syntax
ctrls = c("age", "black", "hisp", "marr")
feols(wage ~ educ + .[ctrls], dat)
feols(wage ~ educ + ..('^x'), dat) # ^ = starts with
feols(wage ~ educ + ..('sp$'), dat) # $ = ends with
feols(wage ~ educ + ..('ac'), data)
I try in pyfixest
, but such syntax is not supported.
import pyfixest as pf
df = pf.get_data()
Xs = ['X1', 'X2']
fit = pf.feols("Y ~ .[Xs] + f2", data=df) # FormulaSyntaxError
fit = pf.feols("Y ~ ('^X') + f2", data=df) # X1 and X2 not included
Do you think mimicking this syntax into pyfixest
is a good idea? Given fixest
already has such syntax, I wouldn't invent a new one anyway.
Do you think mimicking this syntax into pyfixest is a good idea?
Yes, I think so @Wenzhi-Ding . It may be a bit trickier to match the exact syntax under Pythonic constraints. But I think you can get close enough that the translation is clear.
I made a full review for both fixest
and Stata formula syntax. To be honest, some fixest
/Stata syntaxes do not follow a consistent design language.
Overall, I think syntax like sw(x1, x2)
(stepwise), i(treat)
(categorical), and f(wage, 1)
(1-period lead) are good designs, that can be easily understood. Syntax like .[ctrls]
(variable list pre-defined), ..('ac')
(variables starts with), /
, :
, and statefips[year]
are bad designs, which are not straightforward to be understood and remember.
My overall feeling is that we need to follow existing prevalent packages (to mitigate switching costs), but it is also worthwhile to provide alternatives that slightly deviate from the dependent path to reduce onboarding costs (for new users to comprehend and remember syntax quickly). And of course, these two ways can co-exist. R users can use fixest
syntax, while new users may refer to a more readable way.
In this sense, I would like to propose a consistent "function in the formula" design pattern to make pyfixest
formula more readable. The idea is that all special syntax within the formula should be used as a function with a meaningful name. And we should introduce as less symbols as possible. (Now we have ~
, +
, |
for basic syntax and ^
, *
, :
, /
for interaction. I think that is already pretty much to be remembered)
For macros (referencing list outside formula):
controls = ["X1", "X2"]
pf.feols("Y ~ f2 + list(controls)", data=df)
pf.feols("Y ~ f2 + .[controls]", data=df) # fixest's syntax supported
For wildcards (matching variable names by pattern):
pf.feols("Y ~ f2 + vars(X*)", data=df) # support alias: v(X*)
pf.feols("Y ~ f2 + ..('ac')", data=df) # fixest's syntax supported
For lead, lag, difference, and logarithm difference in panel data:
pf.feols("Y ~ f2 + lead(X,1)", data=df, id_var="id", time_var="time") # support alias and fixest's syntax: f(X,1)
pf.feols("Y ~ f2 + lag(X,1)", data=df, id_var="id", time_var="time") # support alias and fixest's syntax: l(X,1)
pf.feols("Y ~ f2 + diff(X,1)", data=df, id_var="id", time_var="time") # support alias d(X,1) or fixest's syntax: d(X)
pf.feols("Y ~ f2 + logdiff(X,1)", data=df, id_var="id", time_var="time") # support alias: logd(X,1)
Categorical with specified baseline (this is a good design from fixest
)
pf.feols("Y ~ f2 + i(X,ref=1)", data=df)
Categorical/continuous: by default, the covariate part are continuous variables, and fixed effects part are categorical variables. But users can also specify variable type by function.
pf.feols("Y ~ c(f2) + i(X1) | group^c(year)", data=df) # to control time trend for each group
Please let me know your opinion. But anyway I wouldn't try to implement all of them at once. And I will guarantee backward compatibility for any new syntax introduced, i.e., no existing script will be broken.
Here is a pyfixest
, fixest
, and Stata syntax comparison:
Reference:
Please correct me if anything wrong. Maybe we can add a webpage like this to smooth R and Stata user's switching.
Item | pyfixest |
fixest |
Stata | Note |
---|---|---|---|---|
Simple regression | pf.feols("wage~educ + age", data=dat) |
feols(wage ~ educ + age, data = dat) |
reg wage educ age |
β |
Categorical variables | pf.feols("wage~educ + i(treat)", dat) |
feols(wage ~ educ + i(treat), dat) |
reg wage educ i.treat |
β |
Categorical variables (specify baseline) | pf.feols("wage~educ + i(treat)", dat, i_ref1=1) |
feols(wage ~ educ + i(treat, ref = 1), dat) |
reg wage educ ib1.treat |
β |
Fixed effects | pf.feols("wage ~ educ \| countyfips", data=df, vcov={"CRV1": "countyfips"}) |
feols(wage ~ educ \| countyfips, dat, vcov = ~countyfips) |
reghdfe wage educ, absorb(countyfips) cluster(countyfips) |
β |
Weights | pf.feols("wage~educ + i(treat)", dat, weights=w) |
feols(wage ~ educ + i(treat), dat, weights = ~w) |
reg wage educ i.treat [aw = w] |
β |
Instrumental variables | pf.feols("wage ~ marr \| countyfips \| educ ~ age", data=dat) |
feols(wage ~ marr \| countyfips \| educ ~ age, dat) |
ivreghdfe 2sls wage marr (educ = age), absorb(countyfips) |
β |
Nonlinear models (Logit) | feglm(marr ~ age + black + hisp, dat, family = 'logit') |
logit marr age black hisp |
β | |
Nonlinear models (Poisson) | pf.fepois("educ ~ age + black + hisp \| statefips + year", dat) |
fepois(educ ~ age + black + hisp \| statefips + year, dat) |
ppmlhdfe educ age black hisp, absorb(statefips year) vce(cluster statefips) |
β |
Macros | ctrls = c("age", "black", "hisp", "marr") |
local ctrls age black hisp marr |
β | |
feols(wage ~ educ + .[ctrls], dat) |
reg wage educ ``ctrls' |
|||
Wildcards (starts with) | feols(wage ~ educ + ..('^x'), dat) |
reg wage educ x* |
β | |
Wildcards (ends with) | feols(wage ~ educ + ..('sp$'), dat) |
reg wage educ *sp |
β | |
Wildcards (contains) | feols(wage ~ educ + ..('ac'), dat) |
reg wage educ *ac* |
β | |
Split sample | feols(wage ~ educ \| countyfips, dat, split = ~hisp) |
β | ||
Multiple dependent variables | pf.feols("wage + educ ~ age + marr \| countyfips + year", dat) |
feols(c(wage, educ) ~ age + marr \| countyfips + year, dat) |
β | |
Stepwise regression | pf.feols("wage ~ educ + sw(age, marr)", dat) |
feols(wage ~ educ + sw(age, marr), dat) |
β | |
Stepwise regression (cumulative) | pf.feols("wage ~ educ + csw(age, marr)", dat) |
feols(wage ~ educ + csw(age, marr), dat) |
β | |
Stepwise regression (cumulative FE) | pf.feols("wage ~ educ \| csw(year, statefips)", dat) |
feols(wage ~ educ \| csw(year, statefips), dat) |
β |
Item | pyfixest |
fixest |
Stata | Note |
---|---|---|---|---|
Interact continuous variables (single) | feols(wage ~ educ:age, dat) |
reg wage c.educ#c.age |
β | |
Interact continuous variables (full) | pf.feols("wage ~ educ*age", dat) |
feols(wage ~ educ*age, dat) |
reg wage c.educ##c.age |
β |
Interact continuous variables (polynomials single) | feols(wage ~ I(age^2), dat) |
reg wage c.age#c.age |
β | |
Interact continuous variables (full) | feols(wage ~ poly(age, 2, raw = TRUE)) |
reg wage c.age##c.age |
β | |
Interact categorical variables | feols(wage ~ i(treat, i.hisp), dat) |
reg wage i.treat#i.hisp |
β | |
Interact categorical variables | feols(wage ~ factor(treat)/factor(hisp), dat) |
reg wage i.treat i.treat#i.hisp |
β | |
Interact categorical variables | feols(wage ~ factor(treat)*factor(hisp), dat) |
reg wage i.treat##i.hisp |
β | |
Interact categorical with continuous variables | pf.feols("wage ~ I(educ,age)", dat) |
feols(wage ~ i(treat, age), dat) |
reg wage i.treat#c.age |
β |
Interact categorical with continuous variables | feols(wage ~ factor(treat):age, dat) |
reg wage i.treat#c.age |
β | |
Interact categorical with continuous variables | feols(wage ~ factor(treat)/age, dat) |
reg wage i.treat i.treat#c.age |
β | |
Interact categorical with continuous variables | pf.feols("wage ~ I(educ,age)", dat) |
feols(wage ~ factor(treat)*age, dat) |
reg wage i.treat##c.age |
β |
Interact fixed effects | pf.feols("wage ~ educ \| statefips^year", dat) |
feols(wage ~ educ \| statefips^year, dat) |
reghdfe wage educ, absorb(statefips#year) |
β |
Interact fixed effects (Varying slopes) | feols(wage ~ educ \| statefips[year], dat) |
reghdfe wage educ, absorb(statefips#c.year) vce(cluster statefips#c.year) |
β |
Item | pyfixest |
fixest |
Stata | Note |
---|---|---|---|---|
HC | pf.feols("wage ~ educ", dat, vcov = 'HC1') |
feols(wage ~ educ, dat, vcov = 'hc1') |
reg wage educ, vce(robust) |
β |
HC | pf.feols("wage ~ educ", dat, vcov = 'HC3') |
feols(wage ~ educ, dat, vcov = sandwich::vcovHC) |
reg wage educ, vce(hc3) |
β |
HAC | feols(y ~ x, dat, vcov = 'NW', panel.id = ~unit + time) |
ivreghdfe wage educ, bw(auto) vce(robust) |
β | |
Clustered (auto clustered by FE) | pf.feols("wage ~ educ \| countyfips", dat) |
feols(wage ~ educ \| countyfips, dat) |
reghdfe wage educ, absorb(countyfips) vce(cluster countyfips) |
β |
Clustered (twoway clustering) | pf.feols("wage ~ educ \| countyfips + year", dat, vcov={"CRV1": "countyfips + year"}) |
feols(wage ~ educ \| countyfips + year, dat, vcov = ~countyfips + year) |
reghdfe wage educ, absorb(countyfips year) vce(cluster countyfips year) |
β |
Clustered (interacted clustering) | feols(wage ~ educ \| countyfips^year, dat, vcov = ~countyfips^year) |
reghdfe wage educ, absorb(countyfips#year) vce(cluster countyfips#year) |
β | |
Conley standard errors | feols(wage ~ educ, dat, vcov = conley("25 mi")) |
β |
These are APIs beyond model formula. Not to be discussed in this thread.
No panel-set function for pyfixest
yet.
Item | pyfixest |
fixest |
Stata | Note |
---|---|---|---|---|
Lag variables | feols(wage ~ educ + l(wage, 1), dat) |
reg wage educ l1.wage |
β | |
Lead variables | feols(wage ~ educ + f(wage, 1), dat) |
reg wage educ f1.wage |
β | |
First difference | feols(wage ~ educ + d(wage), dat) |
reg wage educ D.x |
β |
Thanks for sharing your perspective @grantmcdermott!
If I am honest, I wasn't really aware of fixest
's macros and wildcards, but of course Laurent has implemented all convenience features users could think of and more π I guess as an R-first user, I never had the need to properly consult (the excellent) stata2r
, which I hope can explain my ignorance ;). That given, I agree that there is a lot of value in keeping the syntax of both packages aligned. Even more, if we claim that pyfixest
is a faithful implementation of fixest
, then we should prioritize that the projects do not deviate too much from each other.
@Wenzhi-Ding , thanks for looking into this in so much detail. It'll probably take me a little longer to think through all of it, but here are my initial 5 cents:
Whenever possible, we should try to use formulaic
context when parsing formulas. In a nutshell, context's
allow us to do things as this:
import pandas
from formulaic import model_matrix, Formula
def my_transform(col: pandas.Series) -> pandas.Series:
return col ** 2
# Manually add `my_transform` to the context
Formula("a + my_transform(a)").get_model_matrix(
pandas.DataFrame({"a": [1, 2, 3]}),
context={"my_transform": my_transform}, # could also use: context=locals()
)
This might for example be useful when we have to combine multiple variables to "interacted" fixed effects via ^
syntax, or for implementing lead()
and lag()
. There are more details on this in the formulaic docs which I definitely have to read more carefully =)
On syntax, do I understand your suggestion correctly that you'd like to have both: compatibility with fixest
syntax and special pyfixest
syntax? That's a solution I would personally be more than happy with!
Hi @Wenzhi-Ding , I just merged a PR in which I reworked the logic of model_matrix_fixest. I have moved away from formulaic.model_matrix()
to formulaic.Formula.get_model_matrix()
as recommended in the formulaic
developer guide. This facilitates the code logic in multiple dimensions - most importantly, I no longer have to keep track of NaN values dropped by instruments, endogeneous variables, or regression weights but instead .get_model_matrix()
takes care of it. Additionally, I use formulaic
contexts to map the fixed effects into integers (via pd.factorize). Here we could easily add additional syntax elements, like
lag()and
lead()`.
In general, I think it would be best to try to handle all additional formula syntax within model_matrix_fixest
, except for multiple estimation parsing, which under the current flow of the program needs to happen at the very beginning. Moving multiple estimation parsing into model_matrix_fixest
might be desirable but requires larger refactoring, and I don't get very excited when thinking of this at the moment ...
I also like the idea of adding a comparison page to the docs that compares fixest
syntax with pyfixest
syntax so that users can easily track what is implemented in pyfixest
or where syntax differs. As you have already added Stata comparisons, I'd also just include them, and we should also make sure to link to stata2r, as this has been your main source I believe? =)
Hi Alex @s3alfisc, it's great to hear your update! I have got a bit busy recently and haven't taken a look at implementing the above-mentioned syntax. I will work based on your update once I have time (maybe one or two weeks from now...).
And for the source of syntax comparison, I borrow them from stata2r (Thanks @grantmcdermott for sharing this great source. Help me a lot in switching my project!). But I haven't tested the consistency of the results produced. (I work on my research project recent days and found some inconsistency in clustered standard errors between pyfixest
and reghdfe
. I will dig deeper on this problem after I handle my research project over to my coauthors and maybe open a new issue if it is indeed a problem).
Hi Alex @s3alfisc, it's great to hear your update! I have got a bit busy recently and haven't taken a look at implementing the above-mentioned syntax. I will work based on your update once I have time (maybe one or two weeks from now...).
No worries and absolute 0 pressure! =)
I work on my research project recent days and found some inconsistency in clustered standard errors between pyfixest and reghdfe. I will dig deeper on this problem after I handle my research project over to my coauthors and maybe open a new issue if it is indeed a problem
On small discrepancies of clustered standard errors due to different small sample corrections, the fixest docs can't be beat. I don't know how exactly reghdfe
differs from fixest
, but pyfixest
clustered standard errors slightly differ from fixest
when there are fixed effects. I have written a short note on this here. Aligning fixest
and pyfixest
on this is on my to do list, but I haven't implemented it yet as these small differences rarely matter in my use cases, but this is a good reminder that I should prioritize this going forward =)
Hi Dave,
After reworking the formula parsing, py-fixest
is moving closer to fixest
syntax =) I have dropped support for the i_ref1
argument in place of specifying reference levels within "i()" syntax:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import pyfixest as pf
data = pf.get_data()
fit1 = pf.feols('Y ~ i(f1)', data = data)
fit2 = pf.feols('Y ~ i(f1) + i(f2)', data = data)
fit3 = pf.feols('Y ~ i(f1, ref=1)', data = data)
fit4 = pf.feols('Y ~ i(f1, X1, ref=1)', data = data)
Hi Alex, this is so great to see the update! I am sorry for this late reply. Got kind of busy last week. I will work based on your progress!
Hi Alex,
I am not familiar with
fixest
but was a heavy-user of Stata previously. Is there any approach to include a vector of similar variable names in the formula? For example:This will include all variables like x1, x2, x3 ... into regression model. Is it fine for
pyfixest
to have this syntax? Not sure whetherfixest
has similar syntax. If you think this is feasible, I can do it. Some syntax I propose isI guess syntax 1 will not break any existing syntax and is easy to learn. For syntax 2a-3b, I am not sure whether the benefit on convenience outweigh its learning cost, since they add some complexities, from user's perspective. Maybe we can introduce syntax 1 first, and ignore the rest until several other users propose?
By the way, I will add
keep
,drop
, andexact_match
tocoefplot
this week.Best regards, Dave