Figure out why threefold estimates don't add up to gap when some terms are dropped

davidskalinder commented 5 months ago

Well, here is a puzzle. The sum of all the estimates of a threefold decomposition should yield the same number as the gap between the two groups. And it does for OaxacaBlinderDecomp()'s output, when no variables are dropped due to zero variance in one of the groups.

But when some terms are dropped, the total doesn't add up. This same wrong result also occurs when the same decomposition is run using Hlavac's oaxaca::oaxaca(), but not when using Jann's Stata command. In fact the Stata command produces estimates for some of the terms that it seems should have no variance and so nothing to estimate. However it seems like these impossible estimates must be correct, since when they are included the sums come out right, whereas when they are dropped the sums are the same as the wrong sums produced by the two R packages.

Here's a reprex showing how OaxacaBlinderDecomp() is correct when no terms are dropped and how both OaxacaBlinderDecomp() and oaxaca::oaxaca() are wrong when some terms are dropped:

(Click to expand long regex)

``` r library(OaxacaBlinder) chicago_mod <- chicago chicago_mod$too_young <- chicago_mod$age < 19 fmla_tooyoung_dum <- ln.real.wage ~ LTHS + some.college + college + high.school | too_young fmla_foreign_dum <- ln.real.wage ~ LTHS + some.college + college + high.school | foreign.born obd_foreign <- OaxacaBlinderDecomp( fmla_foreign_dum, chicago_mod, type = "threefold" ) # Should match obd_foreign$varlevel |> as.matrix() |> sum(na.rm = TRUE) #> [1] 0.1433657 obd_foreign$gaps$gap #> [1] 0.1433657 obd_tooyoung <- OaxacaBlinderDecomp( fmla_tooyoung_dum, chicago_mod, type = "threefold" ) # Should match obd_tooyoung$varlevel |> as.matrix() |> sum(na.rm = TRUE) #> [1] 0.9949644 obd_tooyoung$gaps$gap #> [1] 0.5262908 oax_tooyoung <- oaxaca::oaxaca(fmla_tooyoung_dum, chicago_mod, R = NULL) #> oaxaca: oaxaca() performing analysis. Please wait. # Should match oax_tooyoung$threefold$variables |> sum(na.rm = TRUE) #> [1] 0.9949644 oax_tooyoung$y$y.diff #> [1] 0.5262908 ``` ^{Created on 2024-03-29 with [reprex v2.1.0](https://reprex.tidyverse.org)}

Then if I run this to dump the dataset to a .dta file,

chicago_mod |>
  dplyr::rename_with(\(x) stringr::str_replace_all(x, "\\.", "_")) |>
  haven::write_dta("chicago_mod.dta")

I can run the same decomposition in Stata:

(Click to expand long Stata output)

``` . use chicago_mod . oaxaca ln_real_wage LTHS some_college college high_school, by(too_young) relax (model 2 has zero variance coefficients) Blinder-Oaxaca decomposition Number of obs = 666 Model = linear Group 1: too_young = 0 N of obs 1 = 653 Group 2: too_young = 1 N of obs 2 = 13 endowments: (X1 - X2) * b2 coefficients: X2 * (b1 - b2) interaction: (X1 - X2) * (b1 - b2) ------------------------------------------------------------------------------ ln_real_wage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.625413 .0201749 130.13 0.000 2.585871 2.664955 group_2 | 2.099122 .0720749 29.12 0.000 1.957858 2.240387 difference | .5262908 .0748453 7.03 0.000 .3795968 .6729848 endowments | -.1193847 .0847395 -1.41 0.159 -.285471 .0467016 coefficients | .3369692 .0862894 3.91 0.000 .1678451 .5060933 interaction | .3087063 .0960509 3.21 0.001 .12045 .4969626 -------------+---------------------------------------------------------------- endowments | LTHS | -.1193847 .0847395 -1.41 0.159 -.285471 .0467016 some_college | 0 (omitted) college | 0 (omitted) high_school | 0 (omitted) -------------+---------------------------------------------------------------- coefficients | LTHS | -.904495 .1993447 -4.54 0.000 -1.295203 -.5137866 some_college | 0 (omitted) college | 0 (omitted) high_school | -.1808011 .0974085 -1.86 0.063 -.3717182 .0101161 _cons | 1.422265 .1601626 8.88 0.000 1.108352 1.736178 -------------+---------------------------------------------------------------- interaction | LTHS | .5965789 .1710337 3.49 0.000 .2613589 .9317988 some_college | -.1640118 .0248629 -6.60 0.000 -.2127423 -.1152813 college | -.0407052 .0113383 -3.59 0.000 -.0629279 -.0184825 high_school | -.0831556 .0968346 -0.86 0.390 -.2729479 .1066368 ------------------------------------------------------------------------------ . di -.1193847 -.904495 -.1808011 + 1.422265 + .5965789 -.1640118 -.0407052 -.0831556 .5262905 . di -.1193847 -.904495 + 1.422265 + .5965789 .9949642 ```

So I don't know where all those extra terms come from, but it looks like they're right (and I certainly trust Jann).

@sinanpl, any idea what's going on here?

davidskalinder commented 5 months ago

Okay too soon to tell, but here's an early theory.

Looking carefully, Jann's oaxaca, unlike the R packages, prints zeros for the missing estimates, rather than missings or complaints or (omitted) or anything (though it does print omitted for the standard errors). So it's possible that treating these estimates as zeros rather than missings makes the math work properly.

Running lm() just on the too_young group produces no estimates for the high school category, and the same is true running regress on that group in Stata. I think this is because it automatically gets treated as the baseline category? And if that (non-)estimate is treated as zero, X2 * (b1 - b2) (where group 2 is the too_youngs) would indeed end up as non-zero for the high school category, since X2 and b1 are non-zero and b2 is zero. And then all the interaction terms (X1 - X2) * (b1 - b2) would be nonzero since all the X1s and all the b1s are nonzero and all the X2s and b2s are treated as nonmissing.

So I'm starting to think that the proper solution here will be to convert the NAs that lm() creates when a variable has no variance to zeros?

This also does make clear the value of the warning that Jann's command raises when this happens, I think, since one could include an indefinite number of variable that occur only for one group in the data and get (some) estimates for them all. But maybe that's the intended function?

davidskalinder commented 5 months ago

Ah, I think this further supports my theory: adding the swap option (to flip the comparison groups) to the same Stata command shown above yields estimates for every term, which is what we'd expect if all of one group's missing estimates were treated as zeros:

(Click to expand long Stata output)

``` . oaxaca ln_real_wage LTHS some_college college high_school, by(too_young) relax swap (model 1 has zero variance coefficients) Blinder-Oaxaca decomposition Number of obs = 666 Model = linear Group 1: too_young = 1 N of obs 1 = 13 Group 2: too_young = 0 N of obs 2 = 653 endowments: (X1 - X2) * b2 coefficients: X2 * (b1 - b2) interaction: (X1 - X2) * (b1 - b2) ------------------------------------------------------------------------------ ln_real_wage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.099122 .0720749 29.12 0.000 1.957858 2.240387 group_2 | 2.625413 .0201749 130.13 0.000 2.585871 2.664955 difference | -.5262908 .0748453 -7.03 0.000 -.6729848 -.3795968 endowments | -.1893216 .0301481 -6.28 0.000 -.2484109 -.1302323 coefficients | -.6456755 .1057691 -6.10 0.000 -.8529792 -.4383717 interaction | .3087063 .0960509 3.21 0.001 .12045 .4969626 -------------+---------------------------------------------------------------- endowments | LTHS | -.4771942 .1240678 -3.85 0.000 -.7203625 -.2340258 some_college | .1640118 .0248629 6.60 0.000 .1152813 .2127423 college | .0407052 .0113383 3.59 0.000 .0184825 .0629279 high_school | .0831556 .0968346 0.86 0.390 -.1066368 .2729479 -------------+---------------------------------------------------------------- coefficients | LTHS | .3079161 .0514288 5.99 0.000 .2071175 .4087148 some_college | .1640118 .0248629 6.60 0.000 .1152813 .2127423 college | .0407052 .0113383 3.59 0.000 .0184825 .0629279 high_school | .2639567 .0328664 8.03 0.000 .1995397 .3283737 _cons | -1.422265 .1601626 -8.88 0.000 -1.736178 -1.108352 -------------+---------------------------------------------------------------- interaction | LTHS | .5965789 .1710337 3.49 0.000 .2613589 .9317988 some_college | -.1640118 .0248629 -6.60 0.000 -.2127423 -.1152813 college | -.0407052 .0113383 -3.59 0.000 -.0629279 -.0184825 high_school | -.0831556 .0968346 -0.86 0.390 -.2729479 .1066368 ------------------------------------------------------------------------------ ```

davidskalinder commented 5 months ago

One other thing which might turn out to be nothing but I want to note it while I'm thinking about it in case it turns out to matter later: if we really should count dropped estimates as zero for the decomposition math, then I think that probably also means we should also count them as zero when it comes to bootstrapping (as discussed in #23)? Since after all, a draw where all values of a variable are zero should be counted as a legitimate draw where everybody's value on that variable is zero.

This raises another point about this, which is that the case we're looking at has zeros for all rows of some variables. (Like, the too-young people really don't have college degrees, advanced degrees, etc.) That's quite different from the case where a group has missings for all rows of some variables. I should try throwing that at Jann's oaxaca command and see how it handles that...

davidskalinder commented 5 months ago

That's quite different from the case where a group has missings for all rows of some variables. I should try throwing that at Jann's oaxaca command and see how it handles that...

Ah, and a quick and sensible answer -- Stata refuses to run the model at all:

. replace college = . if too_young
(13 real changes made, 13 to missing)

. oaxaca ln_real_wage LTHS some_college college high_school, by(too_young) relax
no observations
r(2000);

. oaxaca ln_real_wage LTHS some_college college high_school, by(too_young) relax swap
no observations
r(2000);

davidskalinder commented 5 months ago

So I'm starting to think that the proper solution here will be to convert the NAs that lm() creates when a variable has no variance to zeros?

Having thought about this over the weekend, I'm increasingly confident that this is in fact the case. It will also resolve #23 pretty easily, since as mentioned above the estimates will simply be zero in any bootstrap runs where there are no non-zero responses.

It does make me a little nervous setting all those coefficients to zero when it seems like they should be undefined, but I'm happy to defer to Jann's implementation on this. That probably does warrant a raising a warning when it happens though.

There are a couple of cases which I'm not as sure how to handle:

When there are enough missings in a category that the point estimates will run, but some of the bootstraps won't
When there's no variance in a category due to every observation having a non-zero value

I should try to cook up a way to run these in Stata and see how they get handled. Then I think I'll be ready to get to work on a PR for this.

davidskalinder commented 5 months ago

When there's no variance in a category due to every observation having a non-zero value

So here's how this goes in Stata.

First, set up the same dataset as above and make a new version of college set to ` instead of 0 for all too-young people:

. use chicago_mod

. gen college2 = college

. replace college2 = 1 if too_young
(13 real changes made)

Next, the same decomposition as in https://github.com/sinanpl/OaxacaBlinder/issues/24#issue-2216096332, but with the noisily option to show each group's regression.

(Click to expand decomp using `college`)

``` . oaxaca ln_real_wage LTHS some_college college high_school, by(too_young) relax noisily Model for group 1 Source | SS df MS Number of obs = 653 -------------+---------------------------------- F(4, 648) = 39.35 Model | 33.7027516 4 8.4256879 Prob > F = 0.0000 Residual | 138.735476 648 .214097956 R-squared = 0.1954 -------------+---------------------------------- Adj R-squared = 0.1905 Total | 172.438227 652 .264475809 Root MSE = .46271 ------------------------------------------------------------------------------ ln_real_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- LTHS | -.940539 .0891212 -10.55 0.000 -1.11554 -.7655377 some_college | -.6611094 .0895106 -7.39 0.000 -.8368753 -.4853434 college | -.3908895 .099192 -3.94 0.000 -.5856661 -.1961128 high_school | -.7834714 .0875428 -8.95 0.000 -.9553731 -.6115697 _cons | 3.340384 .0817959 40.84 0.000 3.179767 3.501001 ------------------------------------------------------------------------------ Model for group 2 Source | SS df MS Number of obs = 13 -------------+---------------------------------- F(1, 11) = 2.25 Model | .127772799 1 .127772799 Prob > F = 0.1621 Residual | .625729212 11 .056884474 R-squared = 0.1696 -------------+---------------------------------- Adj R-squared = 0.0941 Total | .75350201 12 .062791834 Root MSE = .2385 ------------------------------------------------------------------------------ ln_real_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- LTHS | .2353045 .157003 1.50 0.162 -.1102567 .5808658 some_college | 0 (omitted) college | 0 (omitted) high_school | 0 (omitted) _cons | 1.918119 .1377007 13.93 0.000 1.615042 2.221196 ------------------------------------------------------------------------------ (model 2 has zero variance coefficients) Blinder-Oaxaca decomposition Number of obs = 666 Model = linear Group 1: too_young = 0 N of obs 1 = 653 Group 2: too_young = 1 N of obs 2 = 13 endowments: (X1 - X2) * b2 coefficients: X2 * (b1 - b2) interaction: (X1 - X2) * (b1 - b2) ------------------------------------------------------------------------------ ln_real_wage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.625413 .0201749 130.13 0.000 2.585871 2.664955 group_2 | 2.099122 .0720749 29.12 0.000 1.957858 2.240387 difference | .5262908 .0748453 7.03 0.000 .3795968 .6729848 endowments | -.1193847 .0847395 -1.41 0.159 -.285471 .0467016 coefficients | .3369692 .0862894 3.91 0.000 .1678451 .5060933 interaction | .3087063 .0960509 3.21 0.001 .12045 .4969626 -------------+---------------------------------------------------------------- endowments | LTHS | -.1193847 .0847395 -1.41 0.159 -.285471 .0467016 some_college | 0 (omitted) college | 0 (omitted) high_school | 0 (omitted) -------------+---------------------------------------------------------------- coefficients | LTHS | -.904495 .1993447 -4.54 0.000 -1.295203 -.5137866 some_college | 0 (omitted) college | 0 (omitted) high_school | -.1808011 .0974085 -1.86 0.063 -.3717182 .0101161 _cons | 1.422265 .1601626 8.88 0.000 1.108352 1.736178 -------------+---------------------------------------------------------------- interaction | LTHS | .5965789 .1710337 3.49 0.000 .2613589 .9317988 some_college | -.1640118 .0248629 -6.60 0.000 -.2127423 -.1152813 college | -.0407052 .0113383 -3.59 0.000 -.0629279 -.0184825 high_school | -.0831556 .0968346 -0.86 0.390 -.2729479 .1066368 ------------------------------------------------------------------------------ ```

And finally, the same decomposition but using the adjusted version of college.

(Click to expand decomp using `college2`)

``` . oaxaca ln_real_wage LTHS some_college college2 high_school, by(too_young) relax noisily Model for group 1 Source | SS df MS Number of obs = 653 -------------+---------------------------------- F(4, 648) = 39.35 Model | 33.7027516 4 8.4256879 Prob > F = 0.0000 Residual | 138.735476 648 .214097956 R-squared = 0.1954 -------------+---------------------------------- Adj R-squared = 0.1905 Total | 172.438227 652 .264475809 Root MSE = .46271 ------------------------------------------------------------------------------ ln_real_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- LTHS | -.940539 .0891212 -10.55 0.000 -1.11554 -.7655377 some_college | -.6611094 .0895106 -7.39 0.000 -.8368753 -.4853434 college2 | -.3908895 .099192 -3.94 0.000 -.5856661 -.1961128 high_school | -.7834714 .0875428 -8.95 0.000 -.9553731 -.6115697 _cons | 3.340384 .0817959 40.84 0.000 3.179767 3.501001 ------------------------------------------------------------------------------ Model for group 2 Source | SS df MS Number of obs = 13 -------------+---------------------------------- F(1, 11) = 2.25 Model | .127772799 1 .127772799 Prob > F = 0.1621 Residual | .625729212 11 .056884474 R-squared = 0.1696 -------------+---------------------------------- Adj R-squared = 0.0941 Total | .75350201 12 .062791834 Root MSE = .2385 ------------------------------------------------------------------------------ ln_real_wage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- LTHS | .2353045 .157003 1.50 0.162 -.1102567 .5808658 some_college | 0 (omitted) college2 | 0 (omitted) high_school | 0 (omitted) _cons | 1.918119 .1377007 13.93 0.000 1.615042 2.221196 ------------------------------------------------------------------------------ (model 2 has zero variance coefficients) Blinder-Oaxaca decomposition Number of obs = 666 Model = linear Group 1: too_young = 0 N of obs 1 = 653 Group 2: too_young = 1 N of obs 2 = 13 endowments: (X1 - X2) * b2 coefficients: X2 * (b1 - b2) interaction: (X1 - X2) * (b1 - b2) ------------------------------------------------------------------------------ ln_real_wage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.625413 .0201749 130.13 0.000 2.585871 2.664955 group_2 | 2.099122 .0720749 29.12 0.000 1.957858 2.240387 difference | .5262908 .0748453 7.03 0.000 .3795968 .6729848 endowments | -.1193847 .0847395 -1.41 0.159 -.285471 .0467016 coefficients | -.0539203 .1314721 -0.41 0.682 -.3116009 .2037603 interaction | .6995957 .1380755 5.07 0.000 .4289728 .9702187 -------------+---------------------------------------------------------------- endowments | LTHS | -.1193847 .0847395 -1.41 0.159 -.285471 .0467016 some_college | 0 (omitted) college2 | 0 (omitted) high_school | 0 (omitted) -------------+---------------------------------------------------------------- coefficients | LTHS | -.904495 .1993447 -4.54 0.000 -1.295203 -.5137866 some_college | 0 (omitted) college2 | -.3908895 .099192 -3.94 0.000 -.5853023 -.1964766 high_school | -.1808011 .0974085 -1.86 0.063 -.3717182 .0101161 _cons | 1.422265 .1601626 8.88 0.000 1.108352 1.736178 -------------+---------------------------------------------------------------- interaction | LTHS | .5965789 .1710337 3.49 0.000 .2613589 .9317988 some_college | -.1640118 .0248629 -6.60 0.000 -.2127423 -.1152813 college2 | .3501843 .0889856 3.94 0.000 .1757757 .5245929 high_school | -.0831556 .0968346 -0.86 0.390 -.2729479 .1066368 ------------------------------------------------------------------------------ ```

So everything, including both group models, is exactly the same, except that in the final decomposition some of the the interaction term in the college model shows up in the coefficient in the college2 model. I confess I don't precisely understand this, but it sort of makes sense intuitively since in the college2 model the difference happens at a different level of the college dummy. It's even less clear to me why the endowments term doesn't change in that case... But as before, I'm happy to defer to Jann's implementation on this.

So I should probably set up a test to make sure our function comes up with the same estimates as Stata (in this case as in the ones above).

davidskalinder commented 5 months ago

Lots in the works for this one, with luck I'll have a PR (or several!) early next week.

davidskalinder commented 4 months ago

Whew, this ended up being quite a rabbit hole, but I think I've got a fix for the main issue in this thread, so @sinanpl several PRs coming your way for that.

I'm not sure whether they resolve either case in https://github.com/sinanpl/OaxacaBlinder/issues/24#issuecomment-2030815787 though, so it's probably worth leaving this thread open even if the PRs are accepted unless/until we confirm that they are resolved? I think the main problem, that the variable estimates didn't sum to the gap, is a much bigger problem, so hopefully the PRs will resolve that one at least...

davidskalinder commented 4 months ago

Should be fixed in #27, with hard checks added in #25 and #29. Closing.

sinanpl / OaxacaBlinder

Figure out why threefold estimates don't add up to gap when some terms are dropped #24