davidskalinder commented 9 months ago

As I mentioned in https://github.com/sinanpl/OaxacaBlinder/issues/5#issuecomment-1981561085, it might be a good idea to include the full output of each lm() call in the OaxacaBlinderDecomp results object.

Some pros and cons I can think of:

Pro

It'd allow functions like summary() and coef() to query more stuff from the fitted models
It'd tightly couple info about the models to the models that were actually fit, eliminating stuff like #5
It'd defer to lm() the decisions about what info to include in results (aside from the stuff that's unique to Blinder-Oaxaca)
It might make it easier to build further extensions like broom tidiers

Con

It could make OaxacaBlinderDecomp objects bigger (unless the current meta$data is removed)
The teeming throngs of OaxacaBlinder users might start to depend on the extra results, which could make it hard to take them away later if we change our minds

FWIW, oaxaca::oaxaca()'s output does seem to include the full fit objects, although its results are pretty maximal in general so I'm not sure that's something to emulate.

I'm on the fence about this one: I think I lean toward including the full lm fits eventually; but right now this only really affects #5, so maybe it's worth being more conservative and not including them until we have more actual use cases?

5 does depend on this though, so @sinanpl let me know what you think.

sinanpl commented 8 months ago

can be considered! i don't expect the object size to be a large issue here if the lm and OaxacaBlinderDecomp are cleverly stripped from overlapping data. in retrospect, it also helps debugging

i am not sure about how this relates to #5 - which is an issue that has to be solved regardless of including model fits, right?

davidskalinder commented 8 months ago

Sounds good. I don't think this is a super-high priority for me just yet, but I guess if it becomes so I can try to code this in and send a PR?

cleverly stripped from overlapping data

What do you mean by this? (Though frankly, your point that object size might not be worth worrying about could be right regardless.)

i am not sure about how this relates to #5 - which is an issue that has to be solved regardless of including model fits, right?

Yes, in that issue the group sizes are calculated wrong because summary() queries the original input data and handles the NAs wrong. But if the lm objects were in the OaxacaBlinderDecomp object, as this issue suggests, then I think it'd be easy to fix #5 by just having summary() query the model frame from the OaxacaBlinderDecomp's lm objects (in which the NAs are already properly handled).

sinanpl commented 8 months ago

Sounds good. I don't think this is a super-high priority for me just yet, but I guess if it becomes so I can try to code this in and send a PR?

Sure!

cleverly stripped from overlapping data

What do you mean by this? (Though frankly, your point that object size might not be worth worrying about could be right regardless.)

Fitted models to be included in these cases are: model for group A, model for group B and model for group A + B. Each lm object contains the source data. In this case, we could choose to remove the source data and only inlude one copy of the input dataset. Tbh, I don't expect this to be a big problem nonetheless. If you feel like starting such a feature, feel free to include the 3 models as-is including all data.

Yes, in that issue the group sizes are calculated wrong because summary() queries the original input data and handles the NAs wrong. But if the lm objects were in the OaxacaBlinderDecomp object, as this issue suggests, then I think it'd be easy to fix #5 by just having summary() query the model frame from the OaxacaBlinderDecomp's lm objects (in which the NAs are already properly handled).

Noted, if models are included, the count can be extracted from the model objects.

davidskalinder commented 7 months ago

Now depends #18 I believe.

sinanpl / OaxacaBlinder

Add full results of `lm` fits to `OaxacaBlinderDecomp()` output #9

Pro

Con

5 does depend on this though, so @sinanpl let me know what you think.