py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/pyfixest.html
MIT License
118 stars 27 forks source link

Add support for randomization inference #431

Closed s3alfisc closed 1 month ago

s3alfisc commented 1 month ago

This PR adds support for randomization inference via a ritest method for Feols.

apoorvalal commented 1 month ago

Hey Alex,

(possibly unsolicited) metrics advice on this PR: I think using the studentized statistic (where you calculate the t-stat as $\hat{\tau}/\sqrt{\hat{V}}$ from each permutation distribution has better properties [in both the Fisherian and Neymanian sense] than the simple approach of constructing a randomization distribution from the point-estimate alone. Shouldn't be a major change; one would presumably change _get_ritest_coefs to _get_ritest_studentized (or simply add that as an option where instead of returning the point estimate, you return the t-stat.

Ref: chap 7 of Peng Ding's book [based on this 2021 paper]

s3alfisc commented 1 month ago

It's very much appreciated! The more feedback the better =) I actually started out with a t-percentile implementation, but then checked out Grant's version, which worked on the fitted betas. I'll switch back to the t-stats as a default and will enable to run tests on both t-stats and coefficients. I vaguely recally that Alwyn Young recommends to use the beta's and not t-stats for the IV wild bootstrap - you haven't happened to have seen a similar results for RI? Also, great that you point me to Ding's book, I've been looking for a good write up on RI =) Thanks!

s3alfisc commented 1 month ago

I have now implemented two algos - one is "fast" and the other is "slow". Both so far only work for iid sampling.

The "slow" one simply loops over calls of "feols" or "fepois" and hence works for OLS, IV and Poisson regression. You can choose to run different variants, the randomization-c and randomization-t following naming conventions introduced by Young.

The "fast" algorithm only works for OLS and the "randomization-c" at the moment. It's vectorized and employs the FWL theorem; going forward, some speed ups should be possible by JIT compiling it via numba. Users can choose "how much" they want to vectorize (as creating a N x reps matrix can be costly if either N or reps are large). To support the "randomization-t", I will have to slightly rework the functions implemented in the vcov method / make them more "generically available".

Here's a code example:

%load_ext autoreload
%autoreload 2

import pyfixest as pf
import numpy as np
data = pf.get_data(N = 10_000)

fml = "Y ~ X1*X2*f2 |f1 + f3"

fit = pf.feols(fml, data=data)
fit.tidy().head()

rng = np.random.default_rng(1234)
fit.ritest(
    resampvar="X2",
    reps = 10_000,
    rng = rng,
    type = "randomization-c", 
    choose_algorithm = "fast",   
    algo_iterations = 1000,  # choose the number of foor loops: draws reps / algo_iterations per loop
    include_plot= True
)

image

To Do's:

Overall, more work than I expected!

s3alfisc commented 1 month ago

Hi @apoorvalal - one question on the "randomization-t" variant: Which defaults should I set for the computation of the vcov?

Should I default to the vcov type set in the "feols" call? Then it could in principle happend that ritest computes iid inferences even under cluster random assignment - which would not be in the spirit of Athey et al.

Here's an example:

fit = pf.feols("Y ~ X1 | f1", vcov = "iid")         # iid ses
fit.ritest(resampvar = "X1", cluster = "f1")     # cluster random assignment; ses should be CRV1-f1

Under the proposed solution, the vcov matrix in each RI-iteration would be computed as iid, despite the cluster random assignment. With this solution, I should at least add a warning message?

Alternatively, I could default to computing CRV variance matrices on the level of cluster assignment and overwrite the vcov type of the feols call.

Do you have any thoughts on this? I hope you could follow =D

s3alfisc commented 1 month ago

TODO:

apoorvalal commented 1 month ago

Hi Alex, That's an excellent question; I'm not sure I know the answer off the top of my head. I understand the behaviour of randomization-t in the pure randomized trial with no noncompliance setting, but RI is much less clear to me in settings with non-compliance [Young's paper doesn't motivate it from potential outcomes so I don't really know how to reconcile it with Abadie et al and or Ding's papers/book].

Aronow, Chang, and Lopatto put out an interesting looking paper a couple of weeks ago that might be worth looking at as well.

s3alfisc commented 1 month ago

Thanks Apporva - for now, I have deleted the vcov arg to ritest() and now by default compute the vcov as - iid if there is individual level sampling and no controls - HC1 if there is no individual level sampling and controls, and CRV1 for cluster sampling. I think that's a sensible choice, hopefully you agree? 😅

s3alfisc commented 1 month ago

Open to-do's:

review-notebook-app[bot] commented 1 month ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB