Make reference group more predictable

davidskalinder commented 8 months ago

I just noticed that the reference group for the decomposition is determined by which group has the larger mean DV:

https://github.com/sinanpl/OaxacaBlinder/blob/de7e7ac03666fe15f80ecbabadf41a55002a2629/R/oaxaca.R#L35

There are a couple things to note about this:

The code comment notes that the group with the higher mean DV will be the reference group, when in fact (I'm pretty sure) the lower mean DV will be the reference group. (As Jann (2008:455) puts it, when group B is subtracted from group A, the decomposition is "formulated from the viewpoint of group B".) This is a minor point of course, but I suppose the comment should be clarified.
This shares the same problem as #4 and #5 in that the calculation of DV means does not take into account NAs the same way the lm() calls will, and so in theory (though probably rarely) could set the reference group in a different way than would be calculated after na.action.

More seriously though, I think setting the reference group like this, while convenient in many cases, could lead to results that might be confusing and easy to misinterpret. At the very least it should be documented, but I think it would be best to add an option (or two, perhaps) to OaxacaBlinderDecomp() to change how the reference group is set.

I think the current auto-magic behavior would need an on/off switch, leaving the question of what to do when it's turned off. I think we could either require the group variable to be a logical and to always use one value as the reference (FALSE seems intuitive to me, but I think Jann and Hlavac both use TRUE), or to take a second option (which is ignored if the auto-magic behavior is turned on) to determine the reference group.

I think it'd be better to have the user supply the reference value since it doesn't require the variable to be set up as a dummy? But I'm happy to be overrruled. (Note that if we go this route we should probably accept any value here and match it to the label if the group variable is a factor and the reference option is a character.)

davidskalinder commented 8 months ago

(Note that if we go this route we should probably accept any value here and match it to the label if the group variable is a factor and the reference option is a character.)

Whoops, never mind, I forgot that R will (always?) match characters to factor levels. So I think we shouldn't mess with this and instead let R do what R normally does.

sinanpl commented 8 months ago

also related... something I discovered while writing tests... the unexplained_A / B coefs are differing compared to the oaxaca package. So this might be something that will / should be improved.

# fit several models in both packages
formula = real_wage ~ age | male
dataset = OaxacaBlinder::chicago_long

modv1_twofold_neumark = OaxacaBlinder::OaxacaBlinderDecomp(
  formula=formula,
  data = dataset,
  type = 'twofold',
  pooled = 'neumark',
  baseline_invariant = FALSE,
  n_bootstraps = NULL
)

modv2 = oaxaca::oaxaca(formula = formula, data = dataset, R=NULL)
#> oaxaca: oaxaca() performing analysis. Please wait.

modv1_twofold_neumark$overall
#> $explained
#> [1] 0.204005
#> 
#> $unexplained
#> [1] 3.630331
#> 
#> $unexplained_a
#> [1] 1.569873
#> 
#> $unexplained_b
#> [1] 2.060458
modv2$twofold$overall[5, ]
#>        group.weight     coef(explained)       se(explained)   coef(unexplained) 
#>           -1.000000           -0.204005                  NA           -3.630331 
#>     se(unexplained) coef(unexplained A)   se(unexplained A) coef(unexplained B) 
#>                  NA           -2.060458                  NA           -1.569873 
#>   se(unexplained B) 
#>                  NA

^{Created on 2024-03-17 with reprex v2.0.2}

sinanpl commented 8 months ago

in addition, this is currently not the case whn type = 'jann'

# fit several models in both packages
formula = real_wage ~ age | male
dataset = OaxacaBlinder::chicago_long

modv1_twofold = OaxacaBlinder::OaxacaBlinderDecomp(
    formula=formula,
    data = dataset,
    type = 'twofold',
    pooled = 'jann',
    baseline_invariant = FALSE,
    n_bootstraps = NULL
)

modv2 = oaxaca::oaxaca(formula = formula, data = dataset, R=NULL)
#> oaxaca: oaxaca() performing analysis. Please wait.

modv1_twofold$overall
#> $explained
#> [1] 0.1918078
#> 
#> $unexplained
#> [1] 3.642528
#> 
#> $unexplained_a
#> [1] 2.517431e-14
#> 
#> $unexplained_b
#> [1] 3.642528
modv2$twofold$overall[6, ]
#>        group.weight     coef(explained)       se(explained)   coef(unexplained) 
#>       -2.000000e+00       -1.918078e-01                  NA       -3.642528e+00 
#>     se(unexplained) coef(unexplained A)   se(unexplained A) coef(unexplained B) 
#>                  NA       -1.940510e-14                  NA       -3.642528e+00 
#>   se(unexplained B) 
#>                  NA

^{Created on 2024-03-17 with reprex v2.0.2}

davidskalinder commented 7 months ago

Good catches! I've almost exclusively been focusing on the threefold decomposition (which I understand better at this point), so I'm glad you're checking the twofold side...

davidskalinder commented 6 months ago

Update on this: I think this should get fixed by #18 once it's in shape, and I suspect that the differences with oaxaca might get fixed by that too (though frankly 1. we should check and make sure, and 2. I'm becoming less confident in oaxaca anyway and I think it's probably better to run tests against Jann's Stata command instead).

sinanpl / OaxacaBlinder

Make reference group more predictable #10