Fix baseline-invariant results for IVs with zero-variance levels

davidskalinder commented 4 months ago

This should fix the bulk of #24 (though as I mentioned in https://github.com/sinanpl/OaxacaBlinder/issues/24#issuecomment-2046233223, I am not sure about the narrower cases I raised in https://github.com/sinanpl/OaxacaBlinder/issues/24#issuecomment-2030815787).

09e4fd6dc02f682dcda3f64a75ea29a0c10a1723 is arguably the linchpin of the whole thing -- see that commit comment for more details.

This branch does depend on a few other pending PRs, notably #20 and #26. It also adds a test suite, which might conflict with what's in the tests branch (but probably not in any complicated way, since each branch just adds more tests).

Note that I think the diff between #20 and this PR might technically render #20 redundant: #20 merges each model's estimates into a data frame to ensure that the estimates don't get mismatched; whereas this PR (hopefully) ensures that the models all run with the same number of variables in the first place and so should produce the same number of estimates in the same order, rendering mismatching impossible. However I think that doing the decomposition math within a data frame is more robust anyway since it ensures that the correct rows remain locked together, and the code isn't even that much more complicated. So I'd recommend keeping it as is in this PR, but @sinanpl if you strongly prefer the slightly simpler code we had before #20 then I bet it's possible to change it back without breaking much if anything.

As for the tests, this mostly adds three new ones in addition to what's in #20, all to compare different types of results to Stata output: one to for results run using dummies, one for results without baseline invariance run using a factor variable, and another for results with baseline invariance using a factor variable. The Stata results for these comparisons are saved in a new directory for test fixtures (as Hadley recommends), along with numbered scripts to create them and small intermediate data files that Stata needs for input and output. I also had to tweak a couple of the tests from #20, both to make them more informative and because some of the changes in this PR changed the baseline category for some of the tests for reasons I don't understand and mentioned in the commit comment for 87c501944701d28538fdc9fa9742c97d1f65e822.

So, lots here obviously -- sorry for the big PR, but I couldn't figure out a way to make it smaller without breaking lots of stuff. As it stands I don't think there are any breaking changes, except possibly for those changes in the reference category for factor variables in some locales.

Of course @sinanpl let me know if I can help clarify or change any of this and I'll see what I can do!

davidskalinder commented 4 months ago

I just realized that there will be an additional change required when this PR is merged with #15. I'm not sure how to PR that change since the new code will (I think) break either branch before they merge, but the merge of the two will break without it. @sinanpl, if you know of some GitHub-fu I should do to resolve this please let me know.

Here is the diff of the change that will need to happen once the two branches are merged. (FWIW, this is currently in my testing branch, which I don't plan to ever PR, at 56b29f651.)

--- a/R/oaxaca.R
+++ b/R/oaxaca.R
@@ -405,9 +405,8 @@ get_bootstrap_ests = function(formula, data, n_bootstraps, type, pooled, baselin
         baseline_invariant = baseline_invariant
       )
       out$gaps <- calculate_gap(
-        formula,
-        model.frame(fitted_models$mod_a),
-        model.frame(fitted_models$mod_b)
+        fitted_models$group_a$y,
+        fitted_models$group_b$y
       )
       out
     })

davidskalinder commented 4 months ago

As for the tests, this mostly adds three new ones

And one more (failing in d860b84, passing in cc3c725) to test whether the results are the same when a categorical variable has levels with strange characters like spaces or single-quotes.

It looks like model.frame() auto-converts these to dots (in the C code, natch), and so I was ending up with two versions of these names when extracting betas and means from the original model matrix. cc3c725 makes it so that we get the metadata about which variables are categorical(/factors) from the original model matrix, but uses the final model matrix for all the results.

sinanpl / OaxacaBlinder

Fix baseline-invariant results for IVs with zero-variance levels #27