updates reference level, includes all fitted models in objects

This PR introduces more accurate reference levels and extra output to improve debugging. The updates are now more consistent with the oaxaca package. Results below for the twofold models

Breaking changes:

group_var should be factor from now on where reference level is the second level of the factor variable. the first level will be introduced in the model as a dummy when pooled model argument specifies the Jann(2008) model.
OaxacaBlinder::chicago_long is updated to have a factor column gender instead of female / male binary numeric columns

Non-breaking change

output object contains 4 fitted models in output_obj$meta$fitted_models (model for A, model for B and pooled models (jann / neumark)

formula = real_wage ~ age | gender
data = OaxacaBlinder::chicago_long

# oacaxa works with binary / logical
datasetv2 = data
datasetv2$gender = ifelse(datasetv2$gender == "male", 1, 0)

modv1_twofold_neumark = OaxacaBlinder::OaxacaBlinderDecomp(
  formula=formula,
  data = data,
  type = 'twofold',
  pooled = 'neumark',
  baseline_invariant = FALSE,
  n_bootstraps = NULL
)
modv1_twofold_jann = OaxacaBlinder::OaxacaBlinderDecomp(
  formula=formula,
  data = data,
  type = 'twofold',
  pooled = 'jann',
  baseline_invariant = FALSE,
  n_bootstraps = NULL
)
modv2 = oaxaca::oaxaca(formula = formula, data = datasetv2, R=NULL)
#> oaxaca: oaxaca() performing analysis. Please wait.

# check outputs oaxaca for neumark
summary(modv1_twofold_neumark)
#> Oaxaca Blinder Decomposition model
#> ----------------------------------
#> Type: twofold
#> Formula: real_wage ~ age | gender
#> Data: data
#> 
#> Descriptives
#>                  n    %n mean(real_wage)
#> gender==male   412 57.9%           13.69
#> gender==female 300 42.1%           17.52
#> 
#> Gap: -3.83
#> % Diff: -28.01%
#>               coefficient   % of gap
#> explained           -0.20       5.3%
#> unexplained         -3.63      94.7%
#> unexplained_a       -2.06      53.7%
#> unexplained_b       -1.57      40.9%
t(t(modv2$twofold$overall[c(5), ]))
#>                          [,1]
#> group.weight        -1.000000
#> coef(explained)     -0.204005
#> se(explained)              NA
#> coef(unexplained)   -3.630331
#> se(unexplained)            NA
#> coef(unexplained A) -2.060458
#> se(unexplained A)          NA
#> coef(unexplained B) -1.569873
#> se(unexplained B)          NA

# check outputs oaxaca for jann
summary(modv1_twofold_jann)
#> Oaxaca Blinder Decomposition model
#> ----------------------------------
#> Type: twofold
#> Formula: real_wage ~ age | gender
#> Data: data
#> 
#> Descriptives
#>                  n    %n mean(real_wage)
#> gender==male   412 57.9%           13.69
#> gender==female 300 42.1%           17.52
#> 
#> Gap: -3.83
#> % Diff: -28.01%
#>               coefficient   % of gap
#> explained           -0.19       5.0%
#> unexplained         -3.64      95.0%
#> unexplained_a        0.00       0.0%
#> unexplained_b       -3.64      95.0%
t(t(modv2$twofold$overall[c(6), ]))
#>                              [,1]
#> group.weight        -2.000000e+00
#> coef(explained)     -1.918078e-01
#> se(explained)                  NA
#> coef(unexplained)   -3.642528e+00
#> se(unexplained)                NA
#> coef(unexplained A) -1.940510e-14
#> se(unexplained A)              NA
#> coef(unexplained B) -3.642528e+00
#> se(unexplained B)              NA

# check existence of all models in output object
str(modv1_twofold_jann$meta$fitted_models, max.level = 1)
#> List of 4
#>  $ mod_a                 :List of 13
#>   ..- attr(*, "class")= chr "lm"
#>  $ mod_b                 :List of 13
#>   ..- attr(*, "class")= chr "lm"
#>  $ mod_pooled_neumark1988:List of 13
#>   ..- attr(*, "class")= chr "lm"
#>  $ mod_pooled_jann2008   :List of 14
#>   ..- attr(*, "class")= chr "lm"

^{Created on 2024-03-23 with reprex v2.0.2}

@davidskalinder what do you think of this?

this is an alternative + some extras for https://github.com/sinanpl/OaxacaBlinder/pull/11

Some notes here that I don't know what to do with yet:

In oaxaca, observations for which the group variable takes the value 1/TRUE get assigned to group B, whereas 0/FALSE gets assigned to group A:

> oaxaca::oaxaca(ln.real.wage ~ foreign.born | female, oaxaca::chicago, R = NULL) |> purrr::pluck("y")
oaxaca: oaxaca() performing analysis. Please wait.
$y.A
[1] 2.704193

$y.B
[1] 2.498258

$y.diff
[1] 0.2059347

> oaxaca::chicago |> dplyr::group_by(female) |> dplyr::summarize(mean = mean(ln.real.wage, na.rm = TRUE)) |> as.data.frame()
  female     mean
1      0 2.704193
2      1 2.498258

After loading the branch in this PR, observations for which the group variable takes a value of the second factor level get assigned to group A, whereas the first factor level gets assigned to group B:

> OaxacaBlinderDecomp(ln_real_wage ~ foreign_born | gender, chicago_long) |> purrr::pluck("gaps")
$gap
[1] -0.2059347

$pct_gap
[1] -0.08243132

$EY_a
[1] 2.498258

$EY_b
[1] 2.704193

> chicago_long |> dplyr::group_by(gender) |> dplyr::summarize(mean = mean(ln_real_wage, na.rm = TRUE)) |> as.data.frame()
  gender     mean
1   male 2.704193
2 female 2.498258

> levels(chicago_long$gender)
[1] "male"   "female"

... as expected given the assignments in parse_formula() (line 13, currently) and fit_models() (lines 63-64, currently).

So to summarize:

package	group_A_val	group_B_val
oaxaca	0 or FALSE	1 or TRUE
OaxacaBlinder	second level	first level

It seems to me that the implementation above is therefore the wrong way around?

NB that I think part of the problem might be that the language is a bit fuzzy. Jann (2008:455) doesn't use the term "reference" for a value of the group variable, but instead notes that when the decomp is performed in the usual way (subtracting group B from group A, same as in calculate_coefs), the decomp is "formulated from the viewpoint of group B". I think we should probably use that language to avoid getting turned around.

I should also check how Stata defines the groups by default and add it to that table. Might not get to that this evening though.

Here's how Jann's Stata command handles it:

. oaxaca ln_real_wage age normalize(LTHS some_college college high_school advanced_degree), by(female)
(normalized: LTHS some_college college high_school advanced_degree)

Blinder-Oaxaca decomposition                    Number of obs     =        666
                                                Model             =     linear
Group 1: female = 0                             N of obs 1        =        378
Group 2: female = 1                             N of obs 2        =        288

   endowments: (X1 - X2) * b2
 coefficients: X2 * (b1 - b2)
  interaction: (X1 - X2) * (b1 - b2)

So the full table is like so:

package	group_A_val	group_B_val
oaxaca (Hlavac)	0 or FALSE	1 or TRUE
oaxaca (Jann)	0	1
OaxacaBlinder	second level	first level

Of course, the levels of a factor aren't really the same as the two values of a logical, but I think it's natural to view the FALSE value as being like the first level and the TRUE value as the second; so if we want to stick to the implementation of the other packages, the direction of the implementation in this PR should be flipped.

Having said that: I'm not entirely convinced that that's a bad thing, for several reasons:

It seems counterintuitive to me to subtract the TRUE group variable from the FALSE one. (Though I suppose it is more intuitive if the TRUE group is often a disempowered group whose disempowerment is being investigated? So, for instance, it makes sense to do the decomposition from the viewpoint of female == TRUE?)
lm() (and probably other models in R?) by default uses the first level of a factor as the reference level. (It's not entirely clear to me how Stata chooses the reference level.)
relevel() only sets the first level of a factor (which it calls the "reference level"); so it can easily be used to set the first level, but not the second.

I think item 3 is the most serious of these problems, since users need a way to set the "viewpoint" level easily. If they can't do that using relevel(), then I think we should export another function to do it instead (which can probably just be a function that flips relevel() but is more intuitively named).

@sinanpl, any thoughts about this?

I'll add BTW that personally I still feel like the best way to handle this is just to have OaxacaBlinderDecomp() take a new option to specify the value of the viewpoint group. If we follow through on this PR's proposal to drop the auto-group-choice behavior, then this doesn't even have the problem of requiring two options as discussed in https://github.com/sinanpl/OaxacaBlinder/pull/11#issuecomment-2015494016. I think allowing users to directly specify the viewpoint level is clearer and more straightforward, and of course we can still set the default group how we want depending on the considerations above...

So @sinanpl my plan was to try to complete this PR and get it into the codebase so I could start using (and testing!) the package in my project. However, thinking about it, I don't want to make the call on how to handle what I guess I'll now (until I find a better name?) start calling the "viewpoint" group until you weigh in (i.e., on whether to use the first factor level as the other packages do, to use the second level as this PR does, or to have a user-specified option with one of the levels as the default). Obviously any changes to that setup would be breaking changes, and possibly confusing ones at that, so I'm hesitant to send the thing in a direction that's hard to get back from later.

For the time being, I'll implement a new private mod with a simple version of my preferred solution of a user-specified option and use that for testing.

sinanpl / OaxacaBlinder

updates reference level, includes all fitted models in objects #18