sinanpl / OaxacaBlinder

R implementation of Oaxaca-Blinder gap decomposition
MIT License
1 stars 1 forks source link

Clarify or improve `OaxacaBlinderDecomp()`'s handling of NAs #4

Open davidskalinder opened 8 months ago

davidskalinder commented 8 months ago

This reprex shows how to change OaxacaBlinderDecomp()'s handling of NAs (in the DV, in this case, but the same holds for NAs in IVs):

library(OaxacaBlinder)

options(na.action = "na.fail")

dv_na_incl_fail <-
  OaxacaBlinderDecomp(
    formula = real_wage ~ age + education | female, 
    data = chicago_long, 
    type = "twofold", 
    baseline_invariant = TRUE
  )
#> Error in na.fail.default(structure(list(real_wage = c(8.49999985851168, : missing values in object

options(na.action = "na.omit")

dv_na_incl <-
  OaxacaBlinderDecomp(
    formula = real_wage ~ age + education | female, 
    data = chicago_long, 
    type = "twofold", 
    baseline_invariant = TRUE
  )
dv_na_incl
#> Oaxaca Blinder Decomposition model
#> ----------------------------------
#> Type: twofold
#> Formula: real_wage ~ age + education | female
#> Data: chicago_long

Created on 2024-03-01 with reprex v2.1.0

It looks OaxacaBlinderDecomp() usually runs when there are NAs in the DV because it fits the model using lm() with no options:

https://github.com/sinanpl/OaxacaBlinder/blob/5109b3c6a2f6491fa64ec13c4dedf1344023bc70/R/oaxaca.R#L93

Since lm() is called without its na.action argument, its handling of NAs can (only) be set by changing the global na.action option. It seems like it might be worth making it so that OaxacaBlinderDecomp() can pass arguments like na.action, or perhaps others, to lm()? Or at least documenting the behavior?

Probably not an urgent fix for me now that I know that it does this, but it took some investigating to clarify what was happening under the hood...

sinanpl commented 8 months ago

I believe the na.fail option would help for the analyst to have more awareness about the dataset. On the other hand, this holds for IVs as well I assume and forces the analyst to impute or drop NAs.

So question is: 1) do we keep this as is and document the behavior 2) do we make it more strict enforcing the analyst to have a 'clean' dataset

Will think about this. Comments are appreciated.

davidskalinder commented 8 months ago

Well, like I say, couldn't we add an na.action argument to OaxacaBlinderDecomp() and then just pass that to the lm() calls?

I think that would be the best option. Whether or not we do that, I think documenting the behavior would be enough -- no need to make the default more strict I think. After all, this package is just defaulting to the global settings, which is probably what users should expect anyway (part of the reason the behavior was less clear to me was because of #5 and #6).

Come to think of it, this whole issue might not even really be a bug unless you share my opinion that it's silly for R to have a global na.action option at all and that it should be set at the function level. But I do think my opinion is right of course. :)