py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/
MIT License
179 stars 35 forks source link

Unify the handling of random number control #453

Open s3alfisc opened 6 months ago

s3alfisc commented 6 months ago

wildboottest() requires a seed, argument, rwolf() requires seed, ritest() wants an rng, confint() uses a seed. Likely best practice to use a global rng.default_rng() for all function arguments.

This requires changes to wildboottest().

amichuda commented 6 months ago

I'm currently working on a PR that implements a score bootstrap so wildboottest can be used with statsmodels maximum likelihood models (or really any statsmodels that implement score_obs and hessian. I can go ahead and write a PR for this as well.

Was the idea that you can choose seed or input and rng, which determines whether an rng is created during instantiation or just rng?

s3alfisc commented 6 months ago

I'm currently working on a PR that implements a score bootstrap so wildboottest

Oh wow, super cool! Let me know if / once you'd like me to review anything!

Was the idea that you can choose seed or input and rng, which determines whether an rng is created during instantiation or just rng?

So my hunch was that it is best practice to really only define one rng at the top of the scrip and then to pass it to every function in the script - but I'm not 100% sure. Now that 10 minutes have passed, I think that you should be able to easily convince me that we should save ourselves some work and just stick with seed šŸ˜„ In this case, I'd simply change ritest in pyfixest to have a seed argument instead of rng and then I'd have my desired consistency.

Else you're right, the idea here would have been to allow both a seed and an rng argument, with the rng taking priority, and seed eventually being deprecated.

amichuda commented 6 months ago

Else you're right, the idea here would have been to allow both aĀ seedĀ and anĀ rngĀ argument, with theĀ rngĀ taking priority, andĀ seedĀ eventually being deprecated.

Yes, that's how I was thinking of it actually. If the user cares about having the same random number generator, then they have the option, but if they want to just put in a see for reproducibility, then they have that too.

In re: The score bootstrap: absolutely! In fact, I definitely want you to take a look to make sure that it all makes sense. The paper you cite in the repo seems to suggest that score bootstrap has drawbacks, so I want to make sure that it even makes sense to implement it (apparently over rejection is a problem for score bootstrap, but not sure under what conditions).

Also, how do you usually validate that these bootstraps are giving results that make sense? With made up data and homoskedastic errors, standard errors between bootstrap and standard regression results should be very close to one another right?

s3alfisc commented 6 months ago

In re: The score bootstrap: absolutely! In fact, I definitely want you to take a look to make sure that it all makes sense. The paper you cite in the repo seems to suggest that score bootstrap has drawbacks, so I want to make sure that it even makes sense to implement it (apparently over rejection is a problem for score bootstrap, but not sure under what conditions).

Yes - here's a short comment from @droodman on the score bootstrap. The tldr is that he did not find the wild bootstrap to be very useful for non-linear estimators. I am not aware of any extensive simulation exercises on the practical usefulness of the score bootstrap.

A decent alternative for non-linear models seems to be the CRV3 robust estimator, which pyfixest supports for Poisson regression. You can find some simulations in @zeileis paper on the sandwich R package. MacKinnon, Nielsen & Webb show that it is equivalent to a cluster jackknife. It's implemented in pyfixest for feols and fepois. Maybe this suits your use case better than the score bootstrap?

Nevertheless I would be happy about an implementation - it's always better to have it than not to have it - and also to review =)

Regarding tests, I employ two strategies. The first is a "large number of bootstrap draws" strategy, in which I resample randomly and test fwildclusterboot against WildBootTests.jl. We test wildboottest this way against fwildclusterboot here. These tests take annoyingly long to run =D

The second strategy is to test under full enumeration. In this case, the weights matrix is deterministic and all packages should produce exactly identical bootstrap t-statistics. This happens here for wildboottest.

droodman commented 6 months ago

The Kline & Santos paper proposing the score bootstrap has some simulations, I think with the probit model. Code and data, FWIW, are available from https://eml.berkeley.edu/~pkline/.

My perspective is warped by the intense experience of finding a double-hump distribution in a non-linear estimator on a particular data set through a non-parametric bootstrap. I think the score bootstrap or and conventional standard errors would just characterize the curvature of the likelihood surface at whatever maximum was found, and miss that there was a whole other local maximum nearly as good. But I'm not suggesting that the score bootstrap is worse than more standard, non-bootstrap approaches.

amichuda commented 6 months ago

Thank you @droodman ! I was trying to find the code for Kline and Santos, but the journal link was dead for me.

So I suppose there's still value to the score bootstrap in certain cases then? I skimmed your re-analysis paper briefly; was figure 4 generated by score bootstrap?

droodman commented 6 months ago

I did the microcredit work before I had ever heard of a wild bootstrap, and anyway before the score bootstrap was published. So figure 4 was generated by non-parametric or "pairs" bootstrap--the original bootstrap, where you just resample the data, in this case by cluster. It's much more CPU intensive, but gives you more information about the distribution of the estimator in a given context. The score bootstrap is entirely anchored at the full-sample solution, and just improves on the clustering adjustment when large-sample assumptions might not be accurate, like when there are few clusters.

But sure, the score bootstrap can have value, maybe for simple non-linear estimators where multiple local maxima are impossible or unlikely, and where clusters are few.

amichuda commented 6 months ago

Oh I see, so generally, your critique here is for ALL non-parametric bootstraps of nonlinear models and not just for this particular one, the score bootstrap?

droodman commented 6 months ago

I think non-parametric bootstrapping of non-linear estimators can be a very good thing. For me, it gave important insight, exposing the bimodality of a nonlinear estimator.

s3alfisc commented 5 months ago

@amichuda - @mattdwebb just posted on a new WP on cluster robust inference for logit models. Maybe relevant for your use case? =)

amichuda commented 5 months ago

Thanks! So this is the same as the jackknife you mentioned for the poisson models in pyfixest?

s3alfisc commented 5 months ago

Yes, it seems to be the same jackknife estimator as in pyfixest - their conclusion in the abstract seems to be that the CV3 jackknife performs better than conventional CV1 estimators. Among the class of new estimators they introduce, it is not necessarily the best (but seems hard to produce a clear ranking). The jackknife is of course nice as it is easy to implement, but might be computationally costly - they develop an alternative estimator that seems to be faster to compute.

zeileis commented 5 months ago

I agree, this is also my understanding from a quick first read. Generally, JK is always an "ok" choice in terms of performance, even if there might be other flavors that perform a bit better. It may just be costly to compute - but it also may not, e.g., in the case of relatively few clusters.

In some cases, you can compute the JK _exactly_ and _more efficiently_ without any resampling, via the HC3/CV3 estimators. This is the case for cross-section data or for data with relatively low number of observations overall. If it is not possible to compute it exactly, you might still get a computationally cheaper _approximate_ computation via linearization.

If I find the time, I will have a look at whether I can do a fully object-oriented implementation of these linearized score-based computations for sandwich::vcovBS. Not sure when I will find the time, though...