py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/pyfixest.html
MIT License
119 stars 28 forks source link

`lpdid`: compute pre-treatment `ATT` #268

Open s3alfisc opened 5 months ago

Wenzhi-Ding commented 4 months ago

I am exploring these new DID methods recently. Maybe I can take this, as well as integrating other prevalent new DID methods into pyfixest like Callaway and Sant'Anna (2021) and Sun and Abraham (2021), if you think it is good to have these functions inside pyfixest. Or I can also write independent packages that calls pyfixest. Not sure from a design perspective which approach is better.

s3alfisc commented 4 months ago

That would be fantastic @Wenzhi-Ding! All contributions in this area would be very much appreciated. I'd be open to either integrate new estimators into pyfixest and you starting a new repo (if you do, one option would be to simply "fork" the did module into a standalone repo and build upon it?). My suggestion is to start within pyfixest (as the module will be easier to find for users) and then we could decide if it makes sense to have a standalone project in the future?

On the did estimators that are implemented, there are a few things that I think would benefit from a second look / a caring hand. I'm mostly listing them here, not necessarily in order of importance :D

%load_ext autoreload
%autoreload 2

import pandas as pd
from pyfixest.did.estimation import lpdid, event_study, did2s

url = "https://raw.githubusercontent.com/s3alfisc/pyfixest/master/pyfixest/did/data/df_het.csv"
df_het = pd.read_csv(url)

fit = lpdid(
    df_het,
    yname="dep_var",
    idname="unit",
    tname="year",
    gname="g",
    vcov={"CRV1": "state"},
    pre_window=-20,
    post_window=20,
    att=False
)

fit.tidy().index
#Index(['time_to_treatment::-20', 'time_to_treatment::-19',
#       'time_to_treatment::-18', 'time_to_treatment::-17',
#       'time_to_treatment::-16', 'time_to_treatment::-15',
#       'time_to_treatment::-14', 'time_to_treatment::-13',
# etc

while did2s returns

fit = did2s(
    df_het,
    yname="dep_var",
    first_stage="~ 0 | unit + year",
    second_stage="~i(rel_year)",
    treatment="treat",
    cluster="state",
    i_ref1=[-1.0, np.inf],
)
fit._coefnames 
#['C(rel_year,contr.treatment(base=-1.0))[T.-20.0]',
# 'C(rel_year,contr.treatment(base=-1.0))[T.-19.0]',
# 'C(rel_year,contr.treatment(base=-1.0))[T.-18.0]',
Wenzhi-Ding commented 4 months ago

This is super informative! I will also think about these issues you mentioned. I also agree that integrating all together makes it easier for researchers to find (a one-stop solution).

Abstracting a standard DID class will be super cool and influential. In that way, researchers can quickly verify their results across different models. I am still catching the progress of this literature, so I may not be able to contribute code quickly. But if there is any related discussion on this topic, please do notify me. I am more than willing to engage in the discussion.