Fix group size calculation when data has `NA`s

davidskalinder commented 4 months ago

This reprex shows that whether cases with NA in the data are removed explicitly or implictly (as in #4), the objects created by OaxacaBlinderDecomp() are (as expected) identical up to the intentional differences in the dataset, but the summaries of the two objects differ:

Click to expand long reprex

``` r library(OaxacaBlinder) dv_na_incl <- OaxacaBlinderDecomp( formula = real_wage ~ age + education | female, data = chicago_long, type = "twofold", baseline_invariant = TRUE ) dv_na_cut <- OaxacaBlinderDecomp( formula = real_wage ~ age + education | female, data = chicago_long |> dplyr::filter(!is.na(real_wage)), type = "twofold", baseline_invariant = TRUE ) # Should be identical waldo::compare( dv_na_incl[names(dv_na_incl) != "meta"], dv_na_cut[names(dv_na_cut) != "meta"] ) #> ✔ No differences # Should be identical except for dataset name waldo::compare( dv_na_incl$meta[names(dv_na_incl$meta) != "data"], dv_na_cut$meta[names(dv_na_incl$meta) != "data"] ) #> `old$dataset_name`: "chicago_long" #> `new$dataset_name`: "dplyr::filter(chicago_long, !is.na(real_wage))" # Summaries should be identical summary(dv_na_incl) #> Oaxaca Blinder Decomposition model #> ---------------------------------- #> Type: twofold #> Formula: real_wage ~ age + education | female #> Data: chicago_long #> #> Descriptives #> n %n mean(real_wage) #> female==0 412 57.9% 17.52 #> female==1 300 42.1% 13.69 #> #> Gap: 3.83 #> % Diff: 21.88% #> coefficient % of gap #> explained -0.53 -13.9% #> unexplained 4.37 113.9% #> unexplained_a 1.89 49.2% #> unexplained_b 2.48 64.6% summary(dv_na_cut) #> Oaxaca Blinder Decomposition model #> ---------------------------------- #> Type: twofold #> Formula: real_wage ~ age + education | female #> Data: dplyr::filter(chicago_long, !is.na(real_wage)) #> #> Descriptives #> n %n mean(real_wage) #> female==0 378 56.8% 17.52 #> female==1 288 43.2% 13.69 #> #> Gap: 3.83 #> % Diff: 21.88% #> coefficient % of gap #> explained -0.53 -13.9% #> unexplained 4.37 113.9% #> unexplained_a 1.89 49.2% #> unexplained_b 2.48 64.6% ``` ^{Created on 2024-03-01 with [reprex v2.1.0](https://reprex.tidyverse.org)}

The output of summary() is wrong, or at best misleading, because it reports group sizes including cases with NA values even though the model removed them. That's because summary() calculates the sizes directly from the data:

https://github.com/sinanpl/OaxacaBlinder/blob/5109b3c6a2f6491fa64ec13c4dedf1344023bc70/R/generics.R#L20

Hopefully there's a way to get this info from the fitted model itself so we can just rely on lm() to tell us what it did? Needs further investigation.

sinanpl commented 4 months ago

I agree this is wrong. Initial thoughts are to

update counts based on dep_var being available
provide a message (in summary or list / metainfo kind of slot) with "n observations dropped"

In addition perhaps warning message at run for when NA values occur in DV / IV.

Any chance you've checked this scenario with the oaxaca package?

davidskalinder commented 4 months ago

Any chance you've checked this scenario with the oaxaca package?

I think so, though not 100% sure.

I agree this is wrong. Initial thoughts are to
* update counts based on `dep_var` being available

* provide a message (in summary or list / metainfo kind of slot) with "n observations dropped"
In addition perhaps warning message at run for when NA values occur in DV / IV.

So OaxacaBlinderDecomp() calls fit_models(), which returns the whole lm objects for each model. We could just store the result of nobs() for each model in the OaxacaBlinderDecomp results object and then have summary query that?

And in fact, maybe it'd be handy to just have the full lm objects in the OaxacaBlinderDecomp results? It seems like #6 could use them too, and probably other things as well. (For example, I'm not sure how things like augment would work for Blinder-Oaxaca, but it seems like they'd need the full model fits?) There might be downsides though. I'll open another issue for this.

davidskalinder commented 2 months ago

Update on this one: I'm pretty sure this depends on #18 (or should, at least), since that one puts the full model fits in the results object.

sinanpl / OaxacaBlinder

Fix group size calculation when data has `NA`s #5