py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/pyfixest.html
MIT License
140 stars 28 forks source link

Clone `fixest` syntax parser #335

Open Wenzhi-Ding opened 6 months ago

Wenzhi-Ding commented 6 months ago

Hi Alex,

I am not familiar with fixest but was a heavy-user of Stata previously. Is there any approach to include a vector of similar variable names in the formula? For example:

reg y x*

This will include all variables like x1, x2, x3 ... into regression model. Is it fine for pyfixest to have this syntax? Not sure whether fixest has similar syntax. If you think this is feasible, I can do it. Some syntax I propose is

# Syntax 1: To include all variables like x1, x2, x3...
pf.feols("y~x*) 

# Syntax 2a: To include variables pre_3, pre_2, post_0, post_1, post_2 (for event study)
pf.feols("y~pre_[3-2]+post_[0-2]")

# Syntax 2b: Similar to syntax 2a, but additionally support gaps
pf.feols("y~pre_[3(-1)2]+post_[0(1)2]")

# Syntax 3a: To include age_20, age_30, age_40 (for controlling age bins)
pf.feols("y~age_[20,30,40]") 

# Syntax 3b: To include job_clerk, job_police (for controlling specific jobs)
pf.feols("y~job_[clerk,police]") 

I guess syntax 1 will not break any existing syntax and is easy to learn. For syntax 2a-3b, I am not sure whether the benefit on convenience outweigh its learning cost, since they add some complexities, from user's perspective. Maybe we can introduce syntax 1 first, and ignore the rest until several other users propose?

By the way, I will add keep, drop, and exact_match to coefplot this week.

Best regards, Dave

s3alfisc commented 6 months ago

Hi Dave, this is definitely a very interesting suggestion! fixest allows for macros via xpd() but I have to admit that I never really used it. It's also a very advanced functionality, so it might be ok to deviate from fixest syntax.

More concretely, I really like this syntax

pf.feols("y~x*) 

but we'd have to be careful, as * is a special symbol with Wilkinson formulas and if used slightly different, will produce an interaction term, i.e. y ~ x1*x2 evaluates to y ~ x1 + x2 + x1:x2.

More generally it would be amazing if you were to take a look at formula parsing via FormulaParser, which in my opinion it is the least clean part of the code base and will eventually require a major overwhaul πŸ˜„

Wenzhi-Ding commented 6 months ago

Thanks for pointing out the Wilkinson formula problem! I thought for a while and maybe there are two alternatives:

Solution 1: Provide a helper function to resemble variable list

Xs = pf.varlist("X*")  # Xs = "X1 + X2"
pf.feols(f"y ~ {Xs} + f1", data=df)

Solution 2: Parse the varlist function in formula

pf.feols("y ~ vlist(X*) + f1", data=df)

Then we can replacevlist(X*) by "X1 + X2" first, and then pass the formula to formulaic. In this way, the formulaic always receive un-confusing strings.

Solution 2 looks easier for users to learn since it does not require users to have knowledge of f-string. If using solution 2, what name should this function called?

How do you think Alex?

s3alfisc commented 6 months ago

Hi Dave, I like both solutions!

I don't think that syntax as Y ~ X* is a problem per se, it might just work as well. Though I have a slight preference for either the helper method or direct formula parsing. I just went through the formulaic docs (the library that pyfixest uses to create the design matrices) and it looks like there might be a way to add custom operators. But for the beginning, it probably makes a lot of sense to keep the parsing within pyfixest. Though I will open an issue in formulaic to maybe get some guidelines on how one would add a custom parser.

For the immediate future, I'd be happy to accept any of the three solutions you have proposed!

grantmcdermott commented 6 months ago

Unsolicited feedback: Personally, I strongly recommend against adopting any Stata-specific syntax above (R) fixest's native API if you can avoid it.

It's obvs perfectly fine to write a wrapper package that does this for users that want it. (Or implement it as bonus functionality on top of the mimicked API). But similar to my comment here, I think you do a disservice to both packages, and the overall goal of this project, if the API begins to drift away from the upstream R fixest package.

P.S. Ideally, a Stata user could simply consult https://stata2r.github.io/fixest/ and make the direct translation to PyFixest. That's the beauty of having a consistent API across both packages ;-)

Wenzhi-Ding commented 6 months ago

Hi @grantmcdermott, thank you so much for pointing out the https://stata2r.github.io/fixest/ website! It's really informative. So fixest has already supported such syntax

ctrls = c("age", "black", "hisp", "marr")
feols(wage ~ educ + .[ctrls], dat)

feols(wage ~ educ + ..('^x'), dat) # ^ = starts with
feols(wage ~ educ + ..('sp$'), dat) # $ = ends with
feols(wage ~ educ + ..('ac'), data)

I try in pyfixest, but such syntax is not supported.

import pyfixest as pf
df = pf.get_data()

Xs = ['X1', 'X2']
fit = pf.feols("Y ~ .[Xs] + f2", data=df)  # FormulaSyntaxError 

fit = pf.feols("Y ~ ('^X') + f2", data=df)  # X1 and X2 not included

Do you think mimicking this syntax into pyfixest is a good idea? Given fixest already has such syntax, I wouldn't invent a new one anyway.

grantmcdermott commented 6 months ago

Do you think mimicking this syntax into pyfixest is a good idea?

Yes, I think so @Wenzhi-Ding . It may be a bit trickier to match the exact syntax under Pythonic constraints. But I think you can get close enough that the translation is clear.

Wenzhi-Ding commented 6 months ago

I made a full review for both fixest and Stata formula syntax. To be honest, some fixest/Stata syntaxes do not follow a consistent design language.

Overall, I think syntax like sw(x1, x2) (stepwise), i(treat) (categorical), and f(wage, 1) (1-period lead) are good designs, that can be easily understood. Syntax like .[ctrls] (variable list pre-defined), ..('ac') (variables starts with), /, :, and statefips[year] are bad designs, which are not straightforward to be understood and remember.

My overall feeling is that we need to follow existing prevalent packages (to mitigate switching costs), but it is also worthwhile to provide alternatives that slightly deviate from the dependent path to reduce onboarding costs (for new users to comprehend and remember syntax quickly). And of course, these two ways can co-exist. R users can use fixest syntax, while new users may refer to a more readable way.

In this sense, I would like to propose a consistent "function in the formula" design pattern to make pyfixest formula more readable. The idea is that all special syntax within the formula should be used as a function with a meaningful name. And we should introduce as less symbols as possible. (Now we have ~, +, | for basic syntax and ^, *, :, / for interaction. I think that is already pretty much to be remembered)

For macros (referencing list outside formula):

controls = ["X1", "X2"]
pf.feols("Y ~ f2 + list(controls)", data=df)
pf.feols("Y ~ f2 + .[controls]", data=df)  # fixest's syntax supported

For wildcards (matching variable names by pattern):

pf.feols("Y ~ f2 + vars(X*)", data=df)  # support alias: v(X*)
pf.feols("Y ~ f2 + ..('ac')", data=df)  # fixest's syntax supported

For lead, lag, difference, and logarithm difference in panel data:

pf.feols("Y ~ f2 + lead(X,1)", data=df, id_var="id", time_var="time")  # support alias and fixest's syntax: f(X,1)
pf.feols("Y ~ f2 + lag(X,1)", data=df, id_var="id", time_var="time")  # support alias and fixest's syntax: l(X,1)
pf.feols("Y ~ f2 + diff(X,1)", data=df, id_var="id", time_var="time")  # support alias d(X,1) or fixest's syntax: d(X)
pf.feols("Y ~ f2 + logdiff(X,1)", data=df, id_var="id", time_var="time")  # support alias: logd(X,1)

Categorical with specified baseline (this is a good design from fixest)

pf.feols("Y ~ f2 + i(X,ref=1)", data=df)

Categorical/continuous: by default, the covariate part are continuous variables, and fixed effects part are categorical variables. But users can also specify variable type by function.

pf.feols("Y ~ c(f2) + i(X1) | group^c(year)", data=df)  # to control time trend for each group

Please let me know your opinion. But anyway I wouldn't try to implement all of them at once. And I will guarantee backward compatibility for any new syntax introduced, i.e., no existing script will be broken.


Here is a pyfixest, fixest, and Stata syntax comparison:

Reference:

Please correct me if anything wrong. Maybe we can add a webpage like this to smooth R and Stata user's switching.

Models

Item pyfixest fixest Stata Note
Simple regression pf.feols("wage~educ + age", data=dat) feols(wage ~ educ + age, data = dat) reg wage educ age βœ…
Categorical variables pf.feols("wage~educ + i(treat)", dat) feols(wage ~ educ + i(treat), dat) reg wage educ i.treat βœ…
Categorical variables (specify baseline) pf.feols("wage~educ + i(treat)", dat, i_ref1=1) feols(wage ~ educ + i(treat, ref = 1), dat) reg wage educ ib1.treat βœ…
Fixed effects pf.feols("wage ~ educ \| countyfips", data=df, vcov={"CRV1": "countyfips"}) feols(wage ~ educ \| countyfips, dat, vcov = ~countyfips) reghdfe wage educ, absorb(countyfips) cluster(countyfips) βœ…
Weights pf.feols("wage~educ + i(treat)", dat, weights=w) feols(wage ~ educ + i(treat), dat, weights = ~w) reg wage educ i.treat [aw = w] βœ…
Instrumental variables pf.feols("wage ~ marr \| countyfips \| educ ~ age", data=dat) feols(wage ~ marr \| countyfips \| educ ~ age, dat) ivreghdfe 2sls wage marr (educ = age), absorb(countyfips) βœ…
Nonlinear models (Logit) feglm(marr ~ age + black + hisp, dat, family = 'logit') logit marr age black hisp ❌
Nonlinear models (Poisson) pf.fepois("educ ~ age + black + hisp \| statefips + year", dat) fepois(educ ~ age + black + hisp \| statefips + year, dat) ppmlhdfe educ age black hisp, absorb(statefips year) vce(cluster statefips) βœ…
Macros ctrls = c("age", "black", "hisp", "marr") local ctrls age black hisp marr ❌
feols(wage ~ educ + .[ctrls], dat) reg wage educ ``ctrls'
Wildcards (starts with) feols(wage ~ educ + ..('^x'), dat) reg wage educ x* ❌
Wildcards (ends with) feols(wage ~ educ + ..('sp$'), dat) reg wage educ *sp ❌
Wildcards (contains) feols(wage ~ educ + ..('ac'), dat) reg wage educ *ac* ❌
Split sample feols(wage ~ educ \| countyfips, dat, split = ~hisp) ❌
Multiple dependent variables pf.feols("wage + educ ~ age + marr \| countyfips + year", dat) feols(c(wage, educ) ~ age + marr \| countyfips + year, dat) βœ…
Stepwise regression pf.feols("wage ~ educ + sw(age, marr)", dat) feols(wage ~ educ + sw(age, marr), dat) βœ…
Stepwise regression (cumulative) pf.feols("wage ~ educ + csw(age, marr)", dat) feols(wage ~ educ + csw(age, marr), dat) βœ…
Stepwise regression (cumulative FE) pf.feols("wage ~ educ \| csw(year, statefips)", dat) feols(wage ~ educ \| csw(year, statefips), dat) βœ…

Interactions

Item pyfixest fixest Stata Note
Interact continuous variables (single) feols(wage ~ educ:age, dat) reg wage c.educ#c.age ❌
Interact continuous variables (full) pf.feols("wage ~ educ*age", dat) feols(wage ~ educ*age, dat) reg wage c.educ##c.age βœ…
Interact continuous variables (polynomials single) feols(wage ~ I(age^2), dat) reg wage c.age#c.age ❌
Interact continuous variables (full) feols(wage ~ poly(age, 2, raw = TRUE)) reg wage c.age##c.age ❌
Interact categorical variables feols(wage ~ i(treat, i.hisp), dat) reg wage i.treat#i.hisp ❌
Interact categorical variables feols(wage ~ factor(treat)/factor(hisp), dat) reg wage i.treat i.treat#i.hisp ❌
Interact categorical variables feols(wage ~ factor(treat)*factor(hisp), dat) reg wage i.treat##i.hisp ❌
Interact categorical with continuous variables pf.feols("wage ~ I(educ,age)", dat) feols(wage ~ i(treat, age), dat) reg wage i.treat#c.age βœ…
Interact categorical with continuous variables feols(wage ~ factor(treat):age, dat) reg wage i.treat#c.age ❌
Interact categorical with continuous variables feols(wage ~ factor(treat)/age, dat) reg wage i.treat i.treat#c.age ❌
Interact categorical with continuous variables pf.feols("wage ~ I(educ,age)", dat) feols(wage ~ factor(treat)*age, dat) reg wage i.treat##c.age βœ…
Interact fixed effects pf.feols("wage ~ educ \| statefips^year", dat) feols(wage ~ educ \| statefips^year, dat) reghdfe wage educ, absorb(statefips#year) βœ…
Interact fixed effects (Varying slopes) feols(wage ~ educ \| statefips[year], dat) reghdfe wage educ, absorb(statefips#c.year) vce(cluster statefips#c.year) ❌

Standard errors

Item pyfixest fixest Stata Note
HC pf.feols("wage ~ educ", dat, vcov = 'HC1') feols(wage ~ educ, dat, vcov = 'hc1') reg wage educ, vce(robust) βœ…
HC pf.feols("wage ~ educ", dat, vcov = 'HC3') feols(wage ~ educ, dat, vcov = sandwich::vcovHC) reg wage educ, vce(hc3) βœ…
HAC feols(y ~ x, dat, vcov = 'NW', panel.id = ~unit + time) ivreghdfe wage educ, bw(auto) vce(robust) ❌
Clustered (auto clustered by FE) pf.feols("wage ~ educ \| countyfips", dat) feols(wage ~ educ \| countyfips, dat) reghdfe wage educ, absorb(countyfips) vce(cluster countyfips) βœ…
Clustered (twoway clustering) pf.feols("wage ~ educ \| countyfips + year", dat, vcov={"CRV1": "countyfips + year"}) feols(wage ~ educ \| countyfips + year, dat, vcov = ~countyfips + year) reghdfe wage educ, absorb(countyfips year) vce(cluster countyfips year) βœ…
Clustered (interacted clustering) feols(wage ~ educ \| countyfips^year, dat, vcov = ~countyfips^year) reghdfe wage educ, absorb(countyfips#year) vce(cluster countyfips#year) ❌
Conley standard errors feols(wage ~ educ, dat, vcov = conley("25 mi")) ❌

Presentation

These are APIs beyond model formula. Not to be discussed in this thread.

Panel

No panel-set function for pyfixest yet.

Item pyfixest fixest Stata Note
Lag variables feols(wage ~ educ + l(wage, 1), dat) reg wage educ l1.wage ❌
Lead variables feols(wage ~ educ + f(wage, 1), dat) reg wage educ f1.wage ❌
First difference feols(wage ~ educ + d(wage), dat) reg wage educ D.x ❌
s3alfisc commented 6 months ago

Thanks for sharing your perspective @grantmcdermott!

If I am honest, I wasn't really aware of fixest's macros and wildcards, but of course Laurent has implemented all convenience features users could think of and more πŸ˜„ I guess as an R-first user, I never had the need to properly consult (the excellent) stata2r , which I hope can explain my ignorance ;). That given, I agree that there is a lot of value in keeping the syntax of both packages aligned. Even more, if we claim that pyfixest is a faithful implementation of fixest, then we should prioritize that the projects do not deviate too much from each other.

@Wenzhi-Ding , thanks for looking into this in so much detail. It'll probably take me a little longer to think through all of it, but here are my initial 5 cents:

Whenever possible, we should try to use formulaic context when parsing formulas. In a nutshell, context's allow us to do things as this:

import pandas
from formulaic import model_matrix, Formula

def my_transform(col: pandas.Series) -> pandas.Series:
    return col ** 2

# Manually add `my_transform` to the context
Formula("a + my_transform(a)").get_model_matrix(
    pandas.DataFrame({"a": [1, 2, 3]}),
    context={"my_transform": my_transform},  # could also use: context=locals()
)

image

This might for example be useful when we have to combine multiple variables to "interacted" fixed effects via ^ syntax, or for implementing lead() and lag(). There are more details on this in the formulaic docs which I definitely have to read more carefully =)

On syntax, do I understand your suggestion correctly that you'd like to have both: compatibility with fixest syntax and special pyfixest syntax? That's a solution I would personally be more than happy with!

s3alfisc commented 6 months ago

Hi @Wenzhi-Ding , I just merged a PR in which I reworked the logic of model_matrix_fixest. I have moved away from formulaic.model_matrix() to formulaic.Formula.get_model_matrix() as recommended in the formulaic developer guide. This facilitates the code logic in multiple dimensions - most importantly, I no longer have to keep track of NaN values dropped by instruments, endogeneous variables, or regression weights but instead .get_model_matrix() takes care of it. Additionally, I use formulaic contexts to map the fixed effects into integers (via pd.factorize). Here we could easily add additional syntax elements, likelag()andlead()`.

In general, I think it would be best to try to handle all additional formula syntax within model_matrix_fixest, except for multiple estimation parsing, which under the current flow of the program needs to happen at the very beginning. Moving multiple estimation parsing into model_matrix_fixest might be desirable but requires larger refactoring, and I don't get very excited when thinking of this at the moment ...

I also like the idea of adding a comparison page to the docs that compares fixest syntax with pyfixest syntax so that users can easily track what is implemented in pyfixest or where syntax differs. As you have already added Stata comparisons, I'd also just include them, and we should also make sure to link to stata2r, as this has been your main source I believe? =)

Wenzhi-Ding commented 6 months ago

Hi Alex @s3alfisc, it's great to hear your update! I have got a bit busy recently and haven't taken a look at implementing the above-mentioned syntax. I will work based on your update once I have time (maybe one or two weeks from now...).

And for the source of syntax comparison, I borrow them from stata2r (Thanks @grantmcdermott for sharing this great source. Help me a lot in switching my project!). But I haven't tested the consistency of the results produced. (I work on my research project recent days and found some inconsistency in clustered standard errors between pyfixest and reghdfe. I will dig deeper on this problem after I handle my research project over to my coauthors and maybe open a new issue if it is indeed a problem).

s3alfisc commented 6 months ago

Hi Alex @s3alfisc, it's great to hear your update! I have got a bit busy recently and haven't taken a look at implementing the above-mentioned syntax. I will work based on your update once I have time (maybe one or two weeks from now...).

No worries and absolute 0 pressure! =)

I work on my research project recent days and found some inconsistency in clustered standard errors between pyfixest and reghdfe. I will dig deeper on this problem after I handle my research project over to my coauthors and maybe open a new issue if it is indeed a problem

On small discrepancies of clustered standard errors due to different small sample corrections, the fixest docs can't be beat. I don't know how exactly reghdfe differs from fixest, but pyfixest clustered standard errors slightly differ from fixest when there are fixed effects. I have written a short note on this here. Aligning fixest and pyfixest on this is on my to do list, but I haven't implemented it yet as these small differences rarely matter in my use cases, but this is a good reminder that I should prioritize this going forward =)

s3alfisc commented 5 months ago

Hi Dave,

After reworking the formula parsing, py-fixest is moving closer to fixest syntax =) I have dropped support for the i_ref1 argument in place of specifying reference levels within "i()" syntax:

%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import pyfixest as pf
data = pf.get_data()

fit1 = pf.feols('Y ~ i(f1)', data = data)
fit2 = pf.feols('Y ~ i(f1) + i(f2)', data = data)
fit3 = pf.feols('Y ~ i(f1, ref=1)', data = data)
fit4 = pf.feols('Y ~ i(f1, X1, ref=1)', data = data)

image

Wenzhi-Ding commented 5 months ago

Hi Alex, this is so great to see the update! I am sorry for this late reply. Got kind of busy last week. I will work based on your progress!