Open alexpghayes opened 6 years ago
So add_predictions
is meant to replace the modelr function with that name, and with a reversed pair of arguments? I'd be comfortable with that.
You're then saying that what augment()
currently does on e.g. a linear model would require two steps, add_predictions(mod)
and add_training_info(mod)
? In that case, it sounds like augment
could be refactored but with basically the same behavior, and would call:
add_predictions
and add_training_info
if there's a data argumentadd_training_info
if there's a new_data argumentAt this point, an important question is whether add_training_info
would still add .fitted
(it's useful even on training data). But I think if right now augment adds everything it can, it makes sense to keep augment as the "do everything you can" option.
I'm also not sure add_training_info
reads well but can't think of a better name.
On slack earlier today, we discussed the difference between getting statistics for the model on the training set and for other (presumably new) data.
I'm thinking that we should advocate that the model object should not contain any observation-level data from the training set (e.g. residuals or fitted values). Summary statistics, such as the R-sq, RMSE, logLik, etc, should be saved instead.
My reasoning is that keeping these values, which could be recreated, just makes the model object unnecessarily large. It's short sighted in the same way as the decision for R to keep all data in memory. It might be a little more work to get these values, but it would make the model objects more sane.
Also I think it leads people to a misleading mental model where they think of (e.g.) residuals as properties of the model, not a combination of model + data.
This is a followup to #32, but an attempt to start answering "what should
augment()
look like"?I'm of the belief that
augment()
should be split into three different generics:add_predictions(x, new_data, ...)
bind_cols(new_data, safepredict::safe_predict(x, new_data, ...))
bind_predictions()
,add_predicted()
safepredict::safe_predict()
add_training_info(x, data, ...)
data
(influences measures, residuals, etc)influence.measures()
,residuals()
,rstandard()
andrstudent()
.augment()
that doesn't have adata
ornew_data
argument, for transformations that aren't maps (i.e. t-SNE, time series decompositions). Not sure what to call it.A big question is how maintain backwards compatibility with previous versions of
broom
. cc @dgrtwo @topepo