Closed davidskalinder closed 4 months ago
Oh wow, I just realized that all of Jann's code is up at https://github.com/benjann/oaxaca under an MIT license. And not only that, there's an option in his command to bootstrap the standard errors.
So I should probably check his implementation to see how it handles the SEs in cases like this.
I believe #27 fixes this by setting dropped terms to zero (as Jann does) instead of NA
. Closing.
At the moment, if any of the bootstrap runs contain
NA
s for any terms (as they do when one of the groups has no variation for one of a set of dummies), thenquintile()
, rather sensibly, chokes. Here's a reprex of the simpler case in which all of the bootstrap runs have theNA
:Created on 2024-03-29 with reprex v2.1.0
The more complicated case is when some of the bootstrap runs have more terms than others. I assume that at the moment this will throw the same error, but there are probably some additional complexities in this case.
We could get rid of the error by just setting
na.rm = TRUE
for the bootstrap summary calcs (at the moment,quantile()
for the CIs andsd()
for the SEs). In the easy case, where a term is always dropped (or more precisely, when it's dropped for the point estimate), I think it's safe to do this? Since then we're simply not worrying about calculating the sampling distribution for terms that we're not estimating.In the complicated case, where the point estimate contains a term that gets dropped in some but not all of the bootstrap runs, the choice of what to do is less clear. Here are some wrinkles I can think of:
NA
s in fact be accounted for in the distribution somehow? After all, the zero-variance variable was actually drawn from the sample, so maybe those draws should "count"? I don't think I know the answer to this, or where to find it... I might have to talk to some nerdy friends about this. Though TBH I might be slicing this too fine -- I think just usingna.rm = TRUE
will produce a pretty-darn-good estimate, so it's probably worth doing that at first and correcting it later if a more experienced statistician convinces us? (Or maybe @sinanpl you know the answer to this already?)NA
estimates? I think the second one is better, though it will make the code uglier since I think it'll require awhile
oruntil
or something.n
for each term, since they could all be different. Not sure where to do this; it seems like we should probably both put it in the output somewhere and raise a warning or some such?@sinanpl, let me know if you have strong opinions on any of these choices.
Meantime I can at a minimum try to come up with a reprex/test for the complicated case, and possibly fix the simple case (hopefully without hiding the problems of the complicated case).