Choose reference group - Githubissues

davidskalinder commented 5 months ago

My attempt to fix #10, with the settings I suggested there. NB that this does create two new options for OaxacaBlinderDecomp(), so if you don't want to introduce those, don't accept the PR!

Of course it's tricky to test this. The reprex below shows at least one successful test against oaxaca::oaxaca()'s results, though of course it's not exhaustive. I hope that the if/stop block I put into the new reference group selection section is restrictive enough to catch most things that will go wrong since it checks that the two groups exist.

One case that almost certainly will fail is where a group is eliminated by na.action, but it'd be clunky to check that, and if that happens then other parts of modify_group_var_to_dummy() are in trouble too. I think the better solution is probably to not pass the original data but rather the model data to the function?

Click to expand long reprex

``` r oaxaca_results <- oaxaca::oaxaca( ln.real.wage ~ age + female + LTHS + some.college + college + advanced.degree | foreign.born, data = OaxacaBlinder::chicago, R = NULL ) #> oaxaca: oaxaca() performing analysis. Please wait. oaxaca_results_flipped <- oaxaca::oaxaca( ln.real.wage ~ age + female + LTHS + some.college + college + advanced.degree | fbneg, data = OaxacaBlinder::chicago |> dplyr::mutate(fbneg = !foreign.born), R = NULL ) #> oaxaca: oaxaca() performing analysis. Please wait. obd_results <- OaxacaBlinder::OaxacaBlinderDecomp( ln.real.wage ~ age + female + LTHS + some.college + college + advanced.degree | foreign.born, data = OaxacaBlinder::chicago, type = "threefold", ref_group_auto = FALSE, ref_group = 1 ) obd_results_flipped <- OaxacaBlinder::OaxacaBlinderDecomp( ln.real.wage ~ age + female + LTHS + some.college + college + advanced.degree | foreign.born, data = OaxacaBlinder::chicago, type = "threefold", ref_group_auto = FALSE, ref_group = 0 ) oaxaca_results$threefold #> $overall #> coef(endowments) se(endowments) coef(coefficients) se(coefficients) #> 0.07694359 NA 0.12222967 NA #> coef(interaction) se(interaction) #> -0.05580754 NA #> #> $variables #> coef(endowments) se(endowments) coef(coefficients) #> (Intercept) 0.00000000 NA -0.167994227 #> age -0.03706421 NA 0.319427085 #> female -0.01667487 NA -0.039354868 #> LTHS 0.04544444 NA -0.004597125 #> some.college 0.03947192 NA -0.001866322 #> college 0.01035304 NA 0.023877011 #> advanced.degree 0.03541327 NA -0.007261879 #> se(coefficients) coef(interaction) se(interaction) #> (Intercept) NA 0.000000000 NA #> age NA -0.052080622 NA #> female NA -0.008457869 NA #> LTHS NA 0.003193002 NA #> some.college NA -0.003497776 NA #> college NA 0.013756722 NA #> advanced.degree NA -0.008721002 NA obd_results$varlevel #> endowments coefficients interaction #> (Intercept) 0.00000000 -0.167994227 0.000000000 #> age -0.03706421 0.319427085 -0.052080622 #> female -0.01667487 -0.039354868 -0.008457869 #> LTHS 0.04544444 -0.004597125 0.003193002 #> some.college 0.03947192 -0.001866322 -0.003497776 #> college 0.01035304 0.023877011 0.013756722 #> advanced.degree 0.03541327 -0.007261879 -0.008721002 oaxaca_results_flipped$threefold #> $overall #> coef(endowments) se(endowments) coef(coefficients) se(coefficients) #> -0.02113605 NA -0.06642213 NA #> coef(interaction) se(interaction) #> -0.05580754 NA #> #> $variables #> coef(endowments) se(endowments) coef(coefficients) #> (Intercept) 0.00000000 NA 0.167994227 #> age 0.08914483 NA -0.267346463 #> female 0.02513274 NA 0.047812737 #> LTHS -0.04863744 NA 0.001404123 #> some.college -0.03597414 NA 0.005364098 #> college -0.02410977 NA -0.037633733 #> advanced.degree -0.02669226 NA 0.015982880 #> se(coefficients) coef(interaction) se(interaction) #> (Intercept) NA 0.000000000 NA #> age NA -0.052080622 NA #> female NA -0.008457869 NA #> LTHS NA 0.003193002 NA #> some.college NA -0.003497776 NA #> college NA 0.013756722 NA #> advanced.degree NA -0.008721002 NA obd_results_flipped$varlevel #> endowments coefficients interaction #> (Intercept) 0.00000000 0.167994227 0.000000000 #> age 0.08914483 -0.267346463 -0.052080622 #> female 0.02513274 0.047812737 -0.008457869 #> LTHS -0.04863744 0.001404123 0.003193002 #> some.college -0.03597414 0.005364098 -0.003497776 #> college -0.02410977 -0.037633733 0.013756722 #> advanced.degree -0.02669226 0.015982880 -0.008721002 ``` ^{Created on 2024-03-14 with [reprex v2.1.0](https://reprex.tidyverse.org)}

sinanpl commented 5 months ago

I checked the implementation. What do you think of following approach: adding one argument to alter the default behavior from reference=highest to reference=lowest. Code-wise this would be much cleaner with only 1 added argument. In an interactive data analysis setting I think this makes most sense.

# modify group var such that group0 (reference) is the group that has a higher dep_var avg
dep_var_avgs = aggregate(data[[dep_var]], list(gr=data[[group_var]]), FUN=mean, na.rm=TRUE)
dep_var_avgs = dep_var_avgs[order(dep_var_avgs$x, decreasing = TRUE), ]

if (lowest_is_reference){
  group1 = dep_var_avgs$gr[2]
  group2 = dep_var_avgs$gr[1]
} else {
  group1 = dep_var_avgs$gr[1] # higher dep_var avg
  group2 = dep_var_avgs$gr[2] # lower dep_var avg
}

davidskalinder commented 5 months ago

So I agree your suggestion is much cleaner; but it seems to me that choosing the reference group based on the outcomes of the groups is pretty unusual? At least in the contexts I'm used to, it seems like a much more likely scenario is to choose the reference group that's most substantively meaningful: if I want to show how immigration affects wages, then it makes sense to use native-born workers as the baseline no matter what. (In the project I'm working on now, I'm comparing early periods to late periods, so I always want the early period to be the baseline.) That seems even more likely to be the case if someone were to to run multiple models (with different sets of covariates, for example): they'd need to keep the same baseline group for every model in order to compare the models.

The current behavior (as well as the change you suggest) has the cool benefit of letting the user decide whether to force the overall difference in means to be positive or negative, and I can see that being useful in some cases. But I can definitely see cases (like my own!) where it's more important to have the reference group be consistent; and if the reference group is set automatically, it's quite hard to do this.

The extra function arg is definitely a downside though, so I'm certainly up for a simpler solution that allows both types of behavior...

sinanpl commented 5 months ago

ok, makes sense! still hesitant to merge though. I do think that there must be a cleaner solution. my initial thought is to include some logic that processes it the way you describe when group_var is a factor with two levels. I think this would be the most 'R' way.

the user would have to specify the groupvar as a factor in which reference level is the 1st level. accordingly, if the group_var is a factor, the groups A and B are defined based on this. This results in no extra arguments, but perhaps extra documentation in details section and warnings when requirements are not met. We might even force the group_var to be a factor instead of any type...

Need to think about this.

davidskalinder commented 5 months ago

I think this would be the most 'R' way.

Hah yeah I agree that that's pretty base-R-ish, but I'm not sure that's a good thing. (Base R could do with a healthy dose of "explicit is better than implicit" if you ask me.) Maybe more seriously, I always find carefully releveling factors to be confusing and a pain in the neck, so I'm wary of introducing it as a solution here. I actually really like that BlinderOaxacaDecomp() currently cleans up the group variable in a sensible way whether it's a factor or not (let alone whether the levels are set in the right order).

I hear you though about the inelegance of having one option to toggle automatic behavior and another option to control the manual behavior. It'd be good to avoid that.

Maybe there could be a separate user-exposed function that does the automatic behavior? So the main call could look something like the following?

OaxacaBlinderDecomp(
  ln.real.wage ~ education | foreign.born,
  data = OaxacaBlinder::chicago_long,
  type = "threefold",
  ref_group = auto_choose_reference(gap = "positive")
)

sinanpl commented 5 months ago

i'll close this one in favor of https://github.com/sinanpl/OaxacaBlinder/pull/18. still needs some work though

sinanpl / OaxacaBlinder

Choose reference group #11