sinanpl / OaxacaBlinder

R implementation of Oaxaca-Blinder gap decomposition
MIT License
1 stars 1 forks source link

Handle bootstraps when terms are dropped from some or all runs #23

Closed davidskalinder closed 4 months ago

davidskalinder commented 5 months ago

At the moment, if any of the bootstrap runs contain NAs for any terms (as they do when one of the groups has no variation for one of a set of dummies), then quintile(), rather sensibly, chokes. Here's a reprex of the simpler case in which all of the bootstrap runs have the NA:

library(OaxacaBlinder)

chicago_mod <- chicago
chicago_mod$too_young <- chicago_mod$age < 19

fmla_tooyoung_dum <-
  ln.real.wage ~
  LTHS + some.college + college + high.school |
  too_young

point_est_only <-
  OaxacaBlinderDecomp(
    fmla_tooyoung_dum,
    chicago_mod,
    type = "threefold"
  )
point_est_only$varlevel
#>              endowments coefficients interaction
#> (Intercept)   0.0000000     1.422265   0.0000000
#> LTHS         -0.1193847    -0.904495   0.5965789
#> some.college         NA           NA          NA
#> college              NA           NA          NA
#> high.school          NA           NA          NA

bootstrap_attempt <-
  OaxacaBlinderDecomp(
    fmla_tooyoung_dum,
    chicago_mod,
    type = "threefold",
    n_bootstraps = 10
  )
#> Error in quantile.default(sapply(overall_level_list, `[[`, coeftype), : missing values and NaN's not allowed if 'na.rm' is FALSE

Created on 2024-03-29 with reprex v2.1.0

The more complicated case is when some of the bootstrap runs have more terms than others. I assume that at the moment this will throw the same error, but there are probably some additional complexities in this case.

We could get rid of the error by just setting na.rm = TRUE for the bootstrap summary calcs (at the moment, quantile() for the CIs and sd() for the SEs). In the easy case, where a term is always dropped (or more precisely, when it's dropped for the point estimate), I think it's safe to do this? Since then we're simply not worrying about calculating the sampling distribution for terms that we're not estimating.

In the complicated case, where the point estimate contains a term that gets dropped in some but not all of the bootstrap runs, the choice of what to do is less clear. Here are some wrinkles I can think of:

@sinanpl, let me know if you have strong opinions on any of these choices.

Meantime I can at a minimum try to come up with a reprex/test for the complicated case, and possibly fix the simple case (hopefully without hiding the problems of the complicated case).

davidskalinder commented 5 months ago

Oh wow, I just realized that all of Jann's code is up at https://github.com/benjann/oaxaca under an MIT license. And not only that, there's an option in his command to bootstrap the standard errors.

So I should probably check his implementation to see how it handles the SEs in cases like this.

davidskalinder commented 4 months ago

I believe #27 fixes this by setting dropped terms to zero (as Jann does) instead of NA. Closing.