Next steps for `augment()`

alexpghayes commented 6 years ago

This is a followup to #32, but an attempt to start answering "what should augment() look like"?

I'm of the belief that augment() should be split into three different generics:

add_predictions(x, new_data, ...)
- Primarily for adding predictions to the original data for supervised models
- Thin wrapper around bind_cols(new_data, safepredict::safe_predict(x, new_data, ...))
- Other possible names: bind_predictions(), add_predicted()
- Has most of the same arguments as safepredict::safe_predict()
add_training_info(x, data, ...)
- Adds information specific to the training data to data (influences measures, residuals, etc)
- Should inherent a cleaned up version of the arguments to influence.measures(), residuals(), rstandard() and rstudent().
One version of augment() that doesn't have a data or new_data argument, for transformations that aren't maps (i.e. t-SNE, time series decompositions). Not sure what to call it.

A big question is how maintain backwards compatibility with previous versions of broom. cc @dgrtwo @topepo

dgrtwo commented 6 years ago

So add_predictions is meant to replace the modelr function with that name, and with a reversed pair of arguments? I'd be comfortable with that.

You're then saying that what augment() currently does on e.g. a linear model would require two steps, add_predictions(mod) and add_training_info(mod)? In that case, it sounds like augment could be refactored but with basically the same behavior, and would call:

Both add_predictions and add_training_info if there's a data argument
Only add_training_info if there's a new_data argument

At this point, an important question is whether add_training_info would still add .fitted (it's useful even on training data). But I think if right now augment adds everything it can, it makes sense to keep augment as the "do everything you can" option.

I'm also not sure add_training_info reads well but can't think of a better name.

topepo commented 6 years ago

On slack earlier today, we discussed the difference between getting statistics for the model on the training set and for other (presumably new) data.

I'm thinking that we should advocate that the model object should not contain any observation-level data from the training set (e.g. residuals or fitted values). Summary statistics, such as the R-sq, RMSE, logLik, etc, should be saved instead.

My reasoning is that keeping these values, which could be recreated, just makes the model object unnecessarily large. It's short sighted in the same way as the decision for R to keep all data in memory. It might be a little more work to get these values, but it would make the model objects more sane.

hadley commented 6 years ago

Also I think it leads people to a misleading mental model where they think of (e.g.) residuals as properties of the model, not a combination of model + data.

r-lib / generics

Next steps for `augment()` #33