Open davidskalinder opened 4 months ago
I agree this is wrong. Initial thoughts are to
dep_var
being availableIn addition perhaps warning message at run for when NA values occur in DV / IV.
Any chance you've checked this scenario with the oaxaca
package?
Any chance you've checked this scenario with the
oaxaca
package?
I think so, though not 100% sure.
I agree this is wrong. Initial thoughts are to
* update counts based on `dep_var` being available * provide a message (in summary or list / metainfo kind of slot) with "n observations dropped"
In addition perhaps warning message at run for when NA values occur in DV / IV.
So OaxacaBlinderDecomp()
calls fit_models()
, which returns the whole lm
objects for each model. We could just store the result of nobs()
for each model in the OaxacaBlinderDecomp
results object and then have summary
query that?
And in fact, maybe it'd be handy to just have the full lm
objects in the OaxacaBlinderDecomp
results? It seems like #6 could use them too, and probably other things as well. (For example, I'm not sure how things like augment
would work for Blinder-Oaxaca, but it seems like they'd need the full model fits?) There might be downsides though. I'll open another issue for this.
Update on this one: I'm pretty sure this depends on #18 (or should, at least), since that one puts the full model fits in the results object.
This reprex shows that whether cases with
NA
in the data are removed explicitly or implictly (as in #4), the objects created byOaxacaBlinderDecomp()
are (as expected) identical up to the intentional differences in the dataset, but the summaries of the two objects differ:Click to expand long reprex
``` r library(OaxacaBlinder) dv_na_incl <- OaxacaBlinderDecomp( formula = real_wage ~ age + education | female, data = chicago_long, type = "twofold", baseline_invariant = TRUE ) dv_na_cut <- OaxacaBlinderDecomp( formula = real_wage ~ age + education | female, data = chicago_long |> dplyr::filter(!is.na(real_wage)), type = "twofold", baseline_invariant = TRUE ) # Should be identical waldo::compare( dv_na_incl[names(dv_na_incl) != "meta"], dv_na_cut[names(dv_na_cut) != "meta"] ) #> ✔ No differences # Should be identical except for dataset name waldo::compare( dv_na_incl$meta[names(dv_na_incl$meta) != "data"], dv_na_cut$meta[names(dv_na_incl$meta) != "data"] ) #> `old$dataset_name`: "chicago_long" #> `new$dataset_name`: "dplyr::filter(chicago_long, !is.na(real_wage))" # Summaries should be identical summary(dv_na_incl) #> Oaxaca Blinder Decomposition model #> ---------------------------------- #> Type: twofold #> Formula: real_wage ~ age + education | female #> Data: chicago_long #> #> Descriptives #> n %n mean(real_wage) #> female==0 412 57.9% 17.52 #> female==1 300 42.1% 13.69 #> #> Gap: 3.83 #> % Diff: 21.88% #> coefficient % of gap #> explained -0.53 -13.9% #> unexplained 4.37 113.9% #> unexplained_a 1.89 49.2% #> unexplained_b 2.48 64.6% summary(dv_na_cut) #> Oaxaca Blinder Decomposition model #> ---------------------------------- #> Type: twofold #> Formula: real_wage ~ age + education | female #> Data: dplyr::filter(chicago_long, !is.na(real_wage)) #> #> Descriptives #> n %n mean(real_wage) #> female==0 378 56.8% 17.52 #> female==1 288 43.2% 13.69 #> #> Gap: 3.83 #> % Diff: 21.88% #> coefficient % of gap #> explained -0.53 -13.9% #> unexplained 4.37 113.9% #> unexplained_a 1.89 49.2% #> unexplained_b 2.48 64.6% ``` Created on 2024-03-01 with [reprex v2.1.0](https://reprex.tidyverse.org)The output of
summary()
is wrong, or at best misleading, because it reports group sizes including cases withNA
values even though the model removed them. That's becausesummary()
calculates the sizes directly from the data:https://github.com/sinanpl/OaxacaBlinder/blob/5109b3c6a2f6491fa64ec13c4dedf1344023bc70/R/generics.R#L20
Hopefully there's a way to get this info from the fitted model itself so we can just rely on
lm()
to tell us what it did? Needs further investigation.