Ensure matching coefficients are subtracted

davidskalinder commented 5 months ago

I'm getting this when I run a model with lots of bootstraps and lots of IV categories. I need to debug it further, but I suspect it's happening because sometimes one group will have values for a category while the other group has none, so the number of coefficients comes out different? Which come to think of it might be pretty bad if the vectors are just getting recycled even though they don't match...

davidskalinder commented 5 months ago

Okay yes I changed the issue title because this problem is actually more serious than the error message suggested. In fact when group A doesn't have the same number of categories as group B, the wrong coefficients are subtracted from each other.

The error I was getting, Warning in `EX_a - EX_b`:! longer object length is not a multiple of shorter object length only occurs when the larger number of coefficients isn't a multiple of the larger number. When the smaller number divides the larger number evenly, the function silently returns wrong calculations. Classic base R, man.

Technically speaking, we could fix this by matching the coefficients somehow -- seems like if we stick to base R then merge() is probably the right tool for this (though it could lose the nice elegance of the current subtraction formulas).

However now that I think about it, I'm not sure what the proper way to handle this is? When one group doesn't have a variable level, what happens? I suspect that its terms for that variable level should be zero. After all, non-inclusion of a variable in a model is tantamount to declaring that the coefficient of that variable is zero, and that's basically what we're doing by not including a dummy for the missing level. And I think the mean of the missing variable should also be zero, since it could be written as a dummy whose value is always zero.

And in fact that suggests another possibility to me: it should be possible somehow to force R to create that all-zero dummy, probably by forcing it to use the dummies from the whole dataset when it builds the models for the groupwise data? Worth me looking into that, since then we could leave all the coefficient math intact.

davidskalinder commented 5 months ago

Hmm well Jann's Stata oaxaca errors by default when this happens, and offers an option to power through it anyway (in which case is estimates endowments and interactions for the missing term but not coefficients, contradicting what I said in my comment above).

Click to expand long Stata "reprex"

``` . clear all . . use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta (Excerpt from the Swiss Labor Market Survey 1998) . . drop if isco > 3 (890 observations deleted) . tabulate isco, nofreq generate(isco) . gen fem2 = female . replace fem2 = 0 if isco == 1 (25 real changes made) . . oaxaca lnwage normalize(isco?), by(female) (normalized: isco1 isco2 isco3) Blinder-Oaxaca decomposition Number of obs = 757 Model = linear Group 1: female = 0 N of obs 1 = 416 Group 2: female = 1 N of obs 2 = 341 endowments: (X1 - X2) * b2 coefficients: X2 * (b1 - b2) interaction: (X1 - X2) * (b1 - b2) ------------------------------------------------------------------------------ lnwage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 3.591531 .0209001 171.84 0.000 3.550567 3.632494 group_2 | 3.390727 .0277831 122.04 0.000 3.336274 3.445181 difference | .2008034 .0347666 5.78 0.000 .1326622 .2689446 endowments | .0234518 .0149931 1.56 0.118 -.0059341 .0528377 coefficients | .1620186 .0356882 4.54 0.000 .092071 .2319661 interaction | .015333 .0173468 0.88 0.377 -.0186661 .0493321 -------------+---------------------------------------------------------------- endowments | isco1 | -.0044007 .00572 -0.77 0.442 -.0156116 .0068103 isco2 | .0162171 .0087121 1.86 0.063 -.0008582 .0332924 isco3 | .0116354 .0104376 1.11 0.265 -.0088219 .0320926 -------------+---------------------------------------------------------------- coefficients | isco1 | .0079069 .0061258 1.29 0.197 -.0040994 .0199132 isco2 | -.0105815 .0137124 -0.77 0.440 -.0374574 .0162944 isco3 | -.0427462 .0373314 -1.15 0.252 -.1159144 .0304219 _cons | .2074394 .0466316 4.45 0.000 .1160432 .2988356 -------------+---------------------------------------------------------------- interaction | isco1 | .0084262 .0067751 1.24 0.214 -.0048528 .0217052 isco2 | -.007104 .0093066 -0.76 0.445 -.0253447 .0111367 isco3 | .0140108 .0124097 1.13 0.259 -.0103117 .0383334 ------------------------------------------------------------------------------ . oaxaca lnwage normalize(isco?), by(fem2) relax (model 2 has zero variance coefficients) (normalized: isco1 isco2 isco3) Blinder-Oaxaca decomposition Number of obs = 757 Model = linear Group 1: fem2 = 0 N of obs 1 = 441 Group 2: fem2 = 1 N of obs 2 = 316 endowments: (X1 - X2) * b2 coefficients: X2 * (b1 - b2) interaction: (X1 - X2) * (b1 - b2) ------------------------------------------------------------------------------ lnwage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 3.577831 .0221377 161.62 0.000 3.534442 3.621221 group_2 | 3.39396 .0266918 127.15 0.000 3.341644 3.446275 difference | .183872 .0346775 5.30 0.000 .1159053 .2518387 endowments | .0177072 .008676 2.04 0.041 .0007025 .0347118 coefficients | .1498927 .0372997 4.02 0.000 .0767867 .2229987 interaction | .0162722 .0163111 1.00 0.318 -.0156969 .0482412 -------------+---------------------------------------------------------------- endowments | isco1 | -.0104844 .0042139 -2.49 0.013 -.0187435 -.0022254 isco2 | .0118048 .005784 2.04 0.041 .0004684 .0231412 isco3 | .0163868 .0066403 2.47 0.014 .003372 .0294016 -------------+---------------------------------------------------------------- coefficients | isco1 | 0 (omitted) isco2 | -.0036823 .0125799 -0.29 0.770 -.0283385 .0209739 isco3 | -.0221151 .0274036 -0.81 0.420 -.0758253 .031595 _cons | .1756901 .0355231 4.95 0.000 .1060662 .245314 -------------+---------------------------------------------------------------- interaction | isco1 | .0088502 .0083922 1.05 0.292 -.0075982 .0252987 isco2 | -.0016976 .0058189 -0.29 0.770 -.0131026 .0097073 isco3 | .0091195 .0113399 0.80 0.421 -.0131062 .0313453 ------------------------------------------------------------------------------ . oaxaca lnwage normalize(isco?), by(fem2) (model 2 has zero variance coefficients) dropped coefficients or zero variances encountered specify -noisily- to view model estimation output specify -relax- to ignore r(499); ```

Seeing that, I'm inclined to simply be strict about it and refuse to run models where one of the terms is dropped, at least to begin with. Then if we want to add a "relax" option or some such we can do that later.

Of course then the trick is what to do with bootstraps where only some of the samples will drop a level. Seems like the best way will be to allow the bootstraps to keep rolling somehow, probably either reporting the number that was thrown away or continuing until the specified number is reached (though we'd need to be careful that this doesn't result in a too-long loop).

davidskalinder commented 5 months ago

One other wrinkle I just thought of about this: one easy way to see if omitted levels will run into trouble would be to compare the total number of levels for the two models. However, this could occasionally run into the case where both group models drop different levels (or, in bootstraps, where both group models drop the same level but other samples include it).

A nearly-as-easy way around this would be to run an lm (or maybe just a `model.matrix' or some such) for the entire dataset, then count the levels and compare the group levels to that.

sinanpl commented 5 months ago

Can you share an R reprex for this @davidskalinder?

davidskalinder commented 5 months ago

Can you share an R reprex for this @davidskalinder?

Yes, sorry -- just now able to get back to this. This one is tough to reprex because I don't know how to show clearly the result of the error without dropping a browser() into the middle of calculate_coefs() (at line 171, for example). But if you do that and load the local package, then the following code should drop you to a debugger where the error occurs:

set.seed(1973)

threefold <-
  OaxacaBlinderDecomp(
    formula = real_wage ~ education | too_young, 
    data = chicago_long |> dplyr::mutate(too_young = age < 19), 
    type = "threefold", 
    baseline_invariant = TRUE,
    n_bootstraps = NULL
  )

Then you can do this:

Browse[1]> EX_a
          (Intercept)      educationcollege  educationhigh.school         educationLTHS educationsome.college    education.baseline 
           1.00000000            0.10413476            0.33690658            0.26186830            0.24808576            0.04900459 
Browse[1]> EX_b
       (Intercept)      educationLTHS education.baseline 
         1.0000000          0.7692308          0.2307692 
Browse[1]> EX_a - EX_b
          (Intercept)      educationcollege  educationhigh.school         educationLTHS educationsome.college    education.baseline 
            0.0000000            -0.6650960             0.1061374            -0.7381317            -0.5211450            -0.1817646

Group B doesn't have anybody old enough to get beyond high school, so model B doesn't have any data for the other degree categories and so doesn't produce any estimates for them. So when B's estimates are subtracted from A's, the terms don't match, and R hides the problem by recycling group B's shorter vector to fill in the missing levels. (Thank heaven it at least throws a warning if the smaller vector doesn't recycle evenly or I might never have caught it.) The same thing happens with the betas.

So somehow we should make sure the terms are matched when they're subtracted (and multiplied for the interaction estimates), and we'll need to figure out how to handle it when terms get dropped, probably in either model.

sinanpl commented 5 months ago

clear enough, good catch! will look into this

davidskalinder commented 5 months ago

clear enough, good catch! will look into this

Sounds good. This is a pretty high priority for me, so I might try to make some progress on it as well (though I'm not 100% sure when I'll get to it). Any thoughts on how to manage the workflow to try to reduce duplicating each other's work? Maybe we should both try to frequently push to feature branches so we can keep an eye on where the other one has got to? Let me know if there's a strategy you think will work best...

sinanpl commented 5 months ago

sounds good! on my side, my coding investment will be mostly limited to weekends for now :p i'll push anything I have on branches of my repo. I would appreciate your reviews maybe. ok if I assign you as reviewer for something when ready?

davidskalinder commented 5 months ago

Sounds great! This is one of several projects I'm working on at the moment so I tend to have slightly irregular blocks of time to focus on this. But yeah definitely feel free to assign me!

davidskalinder commented 5 months ago

So it turns out that this works fine with dummy variables but not for a categorical variable, since R seems to leave the dummy estimates in with NAs but removes the unused categorical levels.

I just pushed a test that shows this to https://github.com/davidskalinder/OaxacaBlinder/blob/fix_dropped_terms/tests/testthat/test-oaxaca.R . It took a heckuva long time to figure that out though, so that's all that's there for now...

davidskalinder commented 5 months ago

I just pushed a test that shows this to https://github.com/davidskalinder/OaxacaBlinder/blob/fix_dropped_terms/tests/testthat/test-oaxaca.R .

Whoops, and it appears that the test for the good case where no terms are missing passes when run interactively but fails when run via `devtools::test()'. Fun times. So I'll have to debug the test further.

davidskalinder commented 5 months ago

I just pushed a test that shows this to https://github.com/davidskalinder/OaxacaBlinder/blob/fix_dropped_terms/tests/testthat/test-oaxaca.R .

Whoops, and it appears that the test for the good case where no terms are missing passes when run interactively but fails when run via `devtools::test()'. Fun times. So I'll have to debug the test further.

Whew, okay, apparently tests run in an locale with a different sort order, and so the reference level was being chosen differently than it was when run interactively. Should be fixed at 5e01714.

On with the show.

davidskalinder commented 5 months ago

All right this is all looking pretty good at 0c5d7da. Hopefully I'll be able to put a PR together for this tomorrow (though if it slips then it might have to wait until after the weekend).

I think the big outstanding things are:

Currently when any terms are dropped the overall results are all NA. This is the same as in oaxaca::oaxaca(), but it's obviously no good. I think the sum calls that make the overall results need some na.rm = TRUEs, but I should check what Jann's Stata implementation does first.
I haven't touched bootstrapping yet, so I don't know what will happen when individual bootstraps differ.

davidskalinder commented 4 months ago

Fixed in #20, closing.

sinanpl / OaxacaBlinder

Ensure matching coefficients are subtracted #17