Open davidskalinder opened 8 months ago
(Note that if we go this route we should probably accept any value here and match it to the label if the group variable is a factor and the reference option is a character.)
Whoops, never mind, I forgot that R will (always?) match characters to factor levels. So I think we shouldn't mess with this and instead let R do what R normally does.
also related... something I discovered while writing tests... the unexplained_A / B coefs are differing compared to the oaxaca package. So this might be something that will / should be improved.
# fit several models in both packages
formula = real_wage ~ age | male
dataset = OaxacaBlinder::chicago_long
modv1_twofold_neumark = OaxacaBlinder::OaxacaBlinderDecomp(
formula=formula,
data = dataset,
type = 'twofold',
pooled = 'neumark',
baseline_invariant = FALSE,
n_bootstraps = NULL
)
modv2 = oaxaca::oaxaca(formula = formula, data = dataset, R=NULL)
#> oaxaca: oaxaca() performing analysis. Please wait.
modv1_twofold_neumark$overall
#> $explained
#> [1] 0.204005
#>
#> $unexplained
#> [1] 3.630331
#>
#> $unexplained_a
#> [1] 1.569873
#>
#> $unexplained_b
#> [1] 2.060458
modv2$twofold$overall[5, ]
#> group.weight coef(explained) se(explained) coef(unexplained)
#> -1.000000 -0.204005 NA -3.630331
#> se(unexplained) coef(unexplained A) se(unexplained A) coef(unexplained B)
#> NA -2.060458 NA -1.569873
#> se(unexplained B)
#> NA
Created on 2024-03-17 with reprex v2.0.2
in addition, this is currently not the case whn type = 'jann'
# fit several models in both packages
formula = real_wage ~ age | male
dataset = OaxacaBlinder::chicago_long
modv1_twofold = OaxacaBlinder::OaxacaBlinderDecomp(
formula=formula,
data = dataset,
type = 'twofold',
pooled = 'jann',
baseline_invariant = FALSE,
n_bootstraps = NULL
)
modv2 = oaxaca::oaxaca(formula = formula, data = dataset, R=NULL)
#> oaxaca: oaxaca() performing analysis. Please wait.
modv1_twofold$overall
#> $explained
#> [1] 0.1918078
#>
#> $unexplained
#> [1] 3.642528
#>
#> $unexplained_a
#> [1] 2.517431e-14
#>
#> $unexplained_b
#> [1] 3.642528
modv2$twofold$overall[6, ]
#> group.weight coef(explained) se(explained) coef(unexplained)
#> -2.000000e+00 -1.918078e-01 NA -3.642528e+00
#> se(unexplained) coef(unexplained A) se(unexplained A) coef(unexplained B)
#> NA -1.940510e-14 NA -3.642528e+00
#> se(unexplained B)
#> NA
Created on 2024-03-17 with reprex v2.0.2
Good catches! I've almost exclusively been focusing on the threefold decomposition (which I understand better at this point), so I'm glad you're checking the twofold side...
Update on this: I think this should get fixed by #18 once it's in shape, and I suspect that the differences with oaxaca
might get fixed by that too (though frankly 1. we should check and make sure, and 2. I'm becoming less confident in oaxaca
anyway and I think it's probably better to run tests against Jann's Stata command instead).
I just noticed that the reference group for the decomposition is determined by which group has the larger mean DV:
https://github.com/sinanpl/OaxacaBlinder/blob/de7e7ac03666fe15f80ecbabadf41a55002a2629/R/oaxaca.R#L35
There are a couple things to note about this:
NA
s the same way thelm()
calls will, and so in theory (though probably rarely) could set the reference group in a different way than would be calculated afterna.action
.More seriously though, I think setting the reference group like this, while convenient in many cases, could lead to results that might be confusing and easy to misinterpret. At the very least it should be documented, but I think it would be best to add an option (or two, perhaps) to
OaxacaBlinderDecomp()
to change how the reference group is set.I think the current auto-magic behavior would need an on/off switch, leaving the question of what to do when it's turned off. I think we could either require the group variable to be a logical and to always use one value as the reference (
FALSE
seems intuitive to me, but I think Jann and Hlavac both useTRUE
), or to take a second option (which is ignored if the auto-magic behavior is turned on) to determine the reference group.I think it'd be better to have the user supply the reference value since it doesn't require the variable to be set up as a dummy? But I'm happy to be overrruled. (Note that if we go this route we should probably accept any value here and match it to the label if the group variable is a factor and the reference option is a character.)