r-lib / generics

Common generic methods
https://generics.r-lib.org/
Other
61 stars 13 forks source link

Next steps for `augment()` #33

Open alexpghayes opened 6 years ago

alexpghayes commented 6 years ago

This is a followup to #32, but an attempt to start answering "what should augment() look like"?

I'm of the belief that augment() should be split into three different generics:

A big question is how maintain backwards compatibility with previous versions of broom. cc @dgrtwo @topepo

dgrtwo commented 6 years ago

So add_predictions is meant to replace the modelr function with that name, and with a reversed pair of arguments? I'd be comfortable with that.

You're then saying that what augment() currently does on e.g. a linear model would require two steps, add_predictions(mod) and add_training_info(mod)? In that case, it sounds like augment could be refactored but with basically the same behavior, and would call:

At this point, an important question is whether add_training_info would still add .fitted (it's useful even on training data). But I think if right now augment adds everything it can, it makes sense to keep augment as the "do everything you can" option.

I'm also not sure add_training_info reads well but can't think of a better name.

topepo commented 6 years ago

On slack earlier today, we discussed the difference between getting statistics for the model on the training set and for other (presumably new) data.

I'm thinking that we should advocate that the model object should not contain any observation-level data from the training set (e.g. residuals or fitted values). Summary statistics, such as the R-sq, RMSE, logLik, etc, should be saved instead.

My reasoning is that keeping these values, which could be recreated, just makes the model object unnecessarily large. It's short sighted in the same way as the decision for R to keep all data in memory. It might be a little more work to get these values, but it would make the model objects more sane.

hadley commented 6 years ago

Also I think it leads people to a misleading mental model where they think of (e.g.) residuals as properties of the model, not a combination of model + data.