Fix gaps calculation when IVs have NAs

davidskalinder commented 6 months ago

This reprex shows that whether cases with NA in the IVs are removed explicitly or implictly (as in #4), the objects created by OaxacaBlinderDecomp() are (as expected) identical up to the intentional differences in the dataset, but the summaries of the two objects differ not only in the sizes of the groups (as in #5), but also in the estimates of the gaps:

Click to expand long reprex

``` r library(OaxacaBlinder) chicago_long_ivna <- chicago_long |> dplyr::mutate(education = purrr::assign_in(education, 5, NA)) iv_na_incl <- OaxacaBlinderDecomp( formula = real_wage ~ age + education | female, data = chicago_long_ivna, type = "twofold", baseline_invariant = TRUE ) iv_na_cut <- OaxacaBlinderDecomp( formula = real_wage ~ age + education | female, data = chicago_long_ivna |> dplyr::filter(!is.na(education)), type = "twofold", baseline_invariant = TRUE ) # Should be identical waldo::compare( iv_na_incl[names(iv_na_incl) != "meta"], iv_na_cut[names(iv_na_cut) != "meta"] ) #> `old$gaps$gap`: 3.83 #> `new$gaps$gap`: 3.86 #> #> `old$gaps$pct_gap`: 0.219 #> `new$gaps$pct_gap`: 0.220 #> #> `old$gaps$EY_a`: 17.52 #> `new$gaps$EY_a`: 17.55 # Should be identical except for dataset name waldo::compare( iv_na_incl$meta[names(iv_na_incl$meta) != "data"], iv_na_cut$meta[names(iv_na_incl$meta) != "data"] ) #> `old$dataset_name`: "chicago_long_ivna" #> `new$dataset_name`: "dplyr::filter(chicago_long_ivna, !is.na(education))" # Summaries should be identical summary(iv_na_incl) #> Oaxaca Blinder Decomposition model #> ---------------------------------- #> Type: twofold #> Formula: real_wage ~ age + education | female #> Data: chicago_long_ivna #> #> Descriptives #> n %n mean(real_wage) #> female==0 412 57.9% 17.52 #> female==1 300 42.1% 13.69 #> #> Gap: 3.83 #> % Diff: 21.88% #> coefficient % of gap #> explained -0.52 -13.6% #> unexplained 4.38 114.3% #> unexplained_a 1.90 49.5% #> unexplained_b 2.48 64.8% summary(iv_na_cut) #> Oaxaca Blinder Decomposition model #> ---------------------------------- #> Type: twofold #> Formula: real_wage ~ age + education | female #> Data: dplyr::filter(chicago_long_ivna, !is.na(education)) #> #> Descriptives #> n %n mean(real_wage) #> female==0 411 57.8% 17.55 #> female==1 300 42.2% 13.69 #> #> Gap: 3.86 #> % Diff: 21.99% #> coefficient % of gap #> explained -0.52 -13.5% #> unexplained 4.38 113.5% #> unexplained_a 1.90 49.2% #> unexplained_b 2.48 64.4% ``` ^{Created on 2024-03-01 with [reprex v2.1.0](https://reprex.tidyverse.org)}

This is because NAs aren't taken into account when calculating the gaps -- the following line uses na.rm to handle NAs in the DV, but there's nothing to handle the IVs:

https://github.com/sinanpl/OaxacaBlinder/blob/5109b3c6a2f6491fa64ec13c4dedf1344023bc70/R/oaxaca.R#L58

As in #5, the problem is that the gaps are calculated directly from the input dataset, not from the data that the model actually used to run. And so also as in #5, my hope is that there's a way to get the data from the object returned by lm() so that we can let it worry about this stuff for us instead of having to go back to the input data. Not sure yet whether or how that'll work though.

davidskalinder commented 6 months ago

Pretty sure we could do this by just passing fitted_models (or perhaps just each model's model element, which contains the data) to calculate_gap() and then using that for the calculations?

Because this does work as we want it:

> getOption("na.action")
[1] "na.omit"
> airquality$Ozone |> is.na() |> table()

FALSE  TRUE 
  116    37 
> lm(Wind ~ Ozone, data = airquality) |> purrr::pluck("model", "Ozone") |> is.na() |> table()

FALSE 
  116

davidskalinder commented 6 months ago

NB this might actually be simpler than #5 since the calculation all happens within OaxacaBlinderDecomp() and therefore doesn't even need any intermediate results to be stored and called later by generics.

davidskalinder commented 5 months ago

Ah, so of course another trick here is that the two models have different data. We could kludge this by just rbinding the two datasets together, but I think it'll be cleaner to just have calculate_gaps() take an extra argument...

sinanpl / OaxacaBlinder

Fix gaps calculation when IVs have NAs #6