Refactor of factor level handling in baseline adjustment and elsewhere

davidskalinder commented 2 months ago

Okay @sinanpl this one is a pretty major refactoring of the way factor levels are handled.

My main motivation for this one was that I'm looking at alternate strategies for baseline-adjustment and so wanted to abstract the code to find reference levels of factors. However once I started in on that, I realized that wrangling all the levels of factors (many of which occasionally get dropped in various places) was one of the most fragile elements of the whole calculation chain: for instance, that's the whole reason I had to put in the hack of using assemble_models(). So having the factor-wrangling function also enabled me to get rid of the model assembly business altogether and to use the tidied-up factor levels to refactor all the business of extracting betas, calculating EXs, and adjusting baselines.

The only "breaking" change I noticed is actually a fix of a bug in the previous setup, namely that the second reference category of a factor was sometimes being chosen incorrectly when one of the groups didn't contain the first reference category. This affected a few of the tests, so I updated the tests to use the backup reference category that we should have been asking for in the first place.

Lots of code has changed in this PR, so I anticipate future development will be quite hard to merge in if this lingers. @sinanpl, do you know when you might get a chance to review it? If it'll be a while, I could either accept it to dev myself (hopefully after I'm able to use it a bit more), or I could make some even-more-dev branch to accept it into? Let me know what you think best.

davidskalinder commented 2 months ago

Hmph, something funny is going on with this one -- all the tests pass in the build environment, but at least "0-variance baseline-adjusted IV results match Stata" fails interactively. (Other tests might also fail interactively, but I haven't tried them all yet.)

This usually means something fishy is going on with reference categories (since the different environments have different sort orders and therefore different reference categories are chosen in models).

So I should debug this further. @sinanpl of course don't let this stop you if you want to review the PR, but we probably shouldn't accept it until we figure out why the interactive test is failing.

davidskalinder commented 2 months ago

Hmph, something funny is going on with this one -- all the tests pass in the build environment, but at least "0-variance baseline-adjusted IV results match Stata" fails interactively. (Other tests might also fail interactively, but I haven't tried them all yet.)

Okay pretty sure I know what's going on here. When I cajole Stata into using the same reference categories that R automatically chooses in my interactive environment, Stata and OaxacaBlinderDecomp() produce the exact same output at every step: same fits, same adjusted betas, same EXs, same results. So we are matching Stata up to choice of reference category.

The strange thing is that the baseline invariance procedure is supposed to be, well, invariant to the baseline -- the results should be the same no matter what reference categories are picked for the IVs.

So I'm beginning to suspect that this is actually a rare bug in Jann's Stata implementation. What's happening is that when there are zero-variance levels of a categorical variable, they are adjusted using the grand mean of all the levels, even the missing ones. The result is that instead of the baseline category gets treated as one of many categories with a beta of zero, which all wind up with the same adjusted beta. If the baseline is changed though, the category that used to be the baseline gets a non-zero value and the new baseline category gets treated like the missing ones.

All that said, I'm not confident enough to contradict Jann before somebody smarter agrees with me. So I think I should report this on Jann's repo and see what he makes of it. And in the meantime, I think we should keep mirroring his functionality.

The question then becomes what to do about these failing tests. I think the best solution is to have that test use factor IVs and to set the levels explicitly to force the calculation to use the same baselines Stata does.

Mind you that does get rid of that test's ability to confirm that we are converting logical and character IVs converted to factors correctly. The fragility of how R chooses baselines makes this tricky to test if there's no true baseline invariance unfortunately. But maybe I should see if I can spin up a comparison that will be robust across testing environments.

So, I think that leaves these tasks for me:

[x] Have the failing test use fixed factor reference levels
[x] Write a new test to confirm that character and logical IVs are correctly handled as factors
[x] Report Stata's baseline non-invariance

davidskalinder commented 2 months ago

Okay should be all set now. I still need to report the bug to Jann, but this PR should be good to go with all tests passing in both the automated and interactive environments. So @sinanpl, feel free to have at it!

davidskalinder commented 2 months ago

Just reported this at https://github.com/benjann/oaxaca/issues/1. The repo was last updated a little over a year ago, so hopefully Jann will see it there; if not then it might be a good idea for me to send him an email.

sinanpl / OaxacaBlinder

Refactor of factor level handling in baseline adjustment and elsewhere #61