Open s3alfisc opened 6 months ago
I'm currently working on a PR that implements a score bootstrap so wildboottest can be used with statsmodels
maximum likelihood models (or really any statsmodels
that implement score_obs
and hessian
. I can go ahead and write a PR for this as well.
Was the idea that you can choose seed
or input and rng
, which determines whether an rng
is created during instantiation or just rng
?
I'm currently working on a PR that implements a score bootstrap so wildboottest
Oh wow, super cool! Let me know if / once you'd like me to review anything!
Was the idea that you can choose seed or input and rng, which determines whether an rng is created during instantiation or just rng?
So my hunch was that it is best practice to really only define one rng
at the top of the scrip and then to pass it to every function in the script - but I'm not 100% sure. Now that 10 minutes have passed, I think that you should be able to easily convince me that we should save ourselves some work and just stick with seed
š In this case, I'd simply change ritest
in pyfixest to have a seed
argument instead of rng
and then I'd have my desired consistency.
Else you're right, the idea here would have been to allow both a seed
and an rng
argument, with the rng
taking priority, and seed
eventually being deprecated.
Else you're right, the idea here would have been to allow both aĀ seedĀ and anĀ rngĀ argument, with theĀ rngĀ taking priority, andĀ seedĀ eventually being deprecated.
Yes, that's how I was thinking of it actually. If the user cares about having the same random number generator, then they have the option, but if they want to just put in a see for reproducibility, then they have that too.
In re: The score bootstrap: absolutely! In fact, I definitely want you to take a look to make sure that it all makes sense. The paper you cite in the repo seems to suggest that score bootstrap has drawbacks, so I want to make sure that it even makes sense to implement it (apparently over rejection is a problem for score bootstrap, but not sure under what conditions).
Also, how do you usually validate that these bootstraps are giving results that make sense? With made up data and homoskedastic errors, standard errors between bootstrap and standard regression results should be very close to one another right?
In re: The score bootstrap: absolutely! In fact, I definitely want you to take a look to make sure that it all makes sense. The paper you cite in the repo seems to suggest that score bootstrap has drawbacks, so I want to make sure that it even makes sense to implement it (apparently over rejection is a problem for score bootstrap, but not sure under what conditions).
Yes - here's a short comment from @droodman on the score bootstrap. The tldr is that he did not find the wild bootstrap to be very useful for non-linear estimators. I am not aware of any extensive simulation exercises on the practical usefulness of the score bootstrap.
A decent alternative for non-linear models seems to be the CRV3 robust estimator, which pyfixest supports for Poisson regression. You can find some simulations in @zeileis paper on the sandwich R package. MacKinnon, Nielsen & Webb show that it is equivalent to a cluster jackknife. It's implemented in pyfixest for feols and fepois. Maybe this suits your use case better than the score bootstrap?
Nevertheless I would be happy about an implementation - it's always better to have it than not to have it - and also to review =)
Regarding tests, I employ two strategies. The first is a "large number of bootstrap draws" strategy, in which I resample randomly and test fwildclusterboot
against WildBootTests.jl. We test wildboottest
this way against fwildclusterboot
here. These tests take annoyingly long to run =D
The second strategy is to test under full enumeration. In this case, the weights matrix is deterministic and all packages should produce exactly identical bootstrap t-statistics. This happens here for wildboottest.
The Kline & Santos paper proposing the score bootstrap has some simulations, I think with the probit model. Code and data, FWIW, are available from https://eml.berkeley.edu/~pkline/.
My perspective is warped by the intense experience of finding a double-hump distribution in a non-linear estimator on a particular data set through a non-parametric bootstrap. I think the score bootstrap or and conventional standard errors would just characterize the curvature of the likelihood surface at whatever maximum was found, and miss that there was a whole other local maximum nearly as good. But I'm not suggesting that the score bootstrap is worse than more standard, non-bootstrap approaches.
Thank you @droodman ! I was trying to find the code for Kline and Santos, but the journal link was dead for me.
So I suppose there's still value to the score bootstrap in certain cases then? I skimmed your re-analysis paper briefly; was figure 4 generated by score bootstrap?
I did the microcredit work before I had ever heard of a wild bootstrap, and anyway before the score bootstrap was published. So figure 4 was generated by non-parametric or "pairs" bootstrap--the original bootstrap, where you just resample the data, in this case by cluster. It's much more CPU intensive, but gives you more information about the distribution of the estimator in a given context. The score bootstrap is entirely anchored at the full-sample solution, and just improves on the clustering adjustment when large-sample assumptions might not be accurate, like when there are few clusters.
But sure, the score bootstrap can have value, maybe for simple non-linear estimators where multiple local maxima are impossible or unlikely, and where clusters are few.
Oh I see, so generally, your critique here is for ALL non-parametric bootstraps of nonlinear models and not just for this particular one, the score bootstrap?
I think non-parametric bootstrapping of non-linear estimators can be a very good thing. For me, it gave important insight, exposing the bimodality of a nonlinear estimator.
@amichuda - @mattdwebb just posted on a new WP on cluster robust inference for logit models. Maybe relevant for your use case? =)
Thanks! So this is the same as the jackknife you mentioned for the poisson models in pyfixest?
Yes, it seems to be the same jackknife estimator as in pyfixest - their conclusion in the abstract seems to be that the CV3 jackknife performs better than conventional CV1 estimators. Among the class of new estimators they introduce, it is not necessarily the best (but seems hard to produce a clear ranking). The jackknife is of course nice as it is easy to implement, but might be computationally costly - they develop an alternative estimator that seems to be faster to compute.
I agree, this is also my understanding from a quick first read. Generally, JK is always an "ok" choice in terms of performance, even if there might be other flavors that perform a bit better. It may just be costly to compute - but it also may not, e.g., in the case of relatively few clusters.
In some cases, you can compute the JK _exactly_ and _more efficiently_ without any resampling, via the HC3/CV3 estimators. This is the case for cross-section data or for data with relatively low number of observations overall. If it is not possible to compute it exactly, you might still get a computationally cheaper _approximate_ computation via linearization.
If I find the time, I will have a look at whether I can do a fully object-oriented implementation of these linearized score-based computations for sandwich::vcovBS
. Not sure when I will find the time, though...
wildboottest()
requires aseed
, argument,rwolf()
requiresseed
,ritest()
wants anrng
,confint()
uses aseed
. Likely best practice to use a globalrng.default_rng()
for all function arguments.This requires changes to
wildboottest()
.