Separate backend for tidy prediction

alexpghayes commented 6 years ago

I've been reworking the augment() methods and it's rapidly becoming clear that dealing with idiosyncratic predict() methods is going to slow down progress immensely.

In the end broom and parsnip are both going to want to wrap a bajillion predict methods, and we should report predictions in the same way for consistencies sake. I think we should move this functionality to a separate package. Potentially we could use the prediction package by Thomas Leeper, but we should decide on the behavior we want first.

If we define a new generic / series of generics, we can then test these behaviors in modeltests and allow other modelling package developers to guarantee that their predict() methods are sane and consistent.

What I want from a predict method:

Returns a tidy tibble
Never drops missing data (i.e. matches the behavior of predict.lm(..., na.action = na.pass))
Consistent naming of fitted values
Uncertainty in predictions

I want all of these to be guaranteed, and for methods that cannot meet these guarantees, I want an informative error rather than a partially correct output.

topepo commented 6 years ago

Returns a tidy tibble

That would be the most tidy thing to do but my first thought is that it should return a vector unless the prediction contains multiple columns (e.g. class probs, multivariate Y). Let me ponder this a bit.

Never drops missing data

I agree on this. It is something that can happen after prediction if people really don't want those data.

As a corollary, we should always return the same length/nrow as the input even when the prediction produces multiple columns. That may not be automatically 100% tidy, but I think that it is good practice.

Consistent naming of fitted values

👍

We should think about the names though. Let's say that we choose pred as the name (just for argument). For numeric outputs, what about quantile regression, risk/probability estimates, and others that have a different context than a simple numeric prediction? Do we have different names and document them or should we keep a single label?

Uncertainty in predictions

Yes, but not be default (if that's what you mean) since 1) most models don't implement them and 2) they might be computationally taxing and, if you don't want them, why would you take the time to generate them. We should have a formal syntax for getting them though; I just disagree with the defaultedness (sp?)

> Potentially we could use the prediction package by Thomas Leeper Maybe but I don't want to automatically return the original data along with the prediction.

alexpghayes commented 6 years ago

That would be the most tidy thing to do but my first thought is that it should return a vector unless the prediction contains multiple columns (e.g. class probs, multivariate Y). Let me ponder this a bit.

The major argument for always having tibble predictions is type safety. I think it's the right move, especially since yardstick expect predictions to live in a tibble and we've been telling everyone to put everything in a tibble for some time now.

As a corollary, we should always return the same length/nrow as the input even when the prediction produces multiple columns. That may not be automatically 100% tidy, but I think that it is good practice.

This is something I'm running into with broom::augment() right now that isn't as straightforward as I would have hoped. Requiring that nrow(predictions) == nrow(data) makes sense for univariate prediction, but multivariate models sometimes have a more natural representation in long form. For example, considering the joineRML augment method:

library(joineRML)
#> Loading required package: nlme
#> Loading required package: survival
data(heart.valve)
hvd <- heart.valve[!is.na(heart.valve$log.grad) &
                     !is.na(heart.valve$log.lvmi) &
                     heart.valve$num <= 50, ]
fit <- mjoint(
  formLongFixed = list(
    "grad" = log.grad ~ time + sex + hs,
    "lvmi" = log.lvmi ~ time + sex
  ),
  formLongRandom = list(
    "grad" = ~ 1 | num,
    "lvmi" = ~ time | num
  ),
  formSurv = Surv(fuyrs, status) ~ age,
  data = hvd,
  inits = list("gamma" = c(0.11, 1.51, 0.80)),
  timeVar = "time"
)
#> Running multivariate LMM EM algorithm to establish initial parameters...
#> Finished multivariate LMM EM algorithm...
#> EM algorithm has converged!
#> Calculating post model fit statistics...

au <- broom::augment(fit)
colnames(au)
#>  [1] "num"            "sex"            "age"            "time"          
#>  [5] "fuyrs"          "status"         "grad"           "log.grad"      
#>  [9] "lvmi"           "log.lvmi"       "ef"             "bsa"           
#> [13] "lvh"            "prenyha"        "redo"           "size"          
#> [17] "con.cabg"       "creat"          "dm"             "acei"          
#> [21] "lv"             "emergenc"       "hc"             "sten.reg.mix"  
#> [25] "hs"             ".fitted_grad_0" ".fitted_lvmi_0" ".fitted_grad_1"
#> [29] ".fitted_lvmi_1" ".resid_grad_0"  ".resid_lvmi_0"  ".resid_grad_1" 
#> [33] ".resid_lvmi_1"

Created on 2018-08-07 by the reprex package (v0.2.0).

Here it would make sense to have an id / outcome column and .fitted and .resid rather than .fitted_outcome_1, .resid_outcome_1, etc.

Dave pretty strongly believes that it is better for users to spread data than to gather it. I'm not entirely sure how I feel about this, but the nice thing about returning data in a long format is you have predictable columns and unpredictable rows, as opposed to predictable rows and unpredictable columns. I think this is better for other developers extending existing work, since they are less likely to make assumptions about which rows are present than which columns are present.

We should think about the names though. Let's say that we choose pred as the name (just for argument). For numeric outputs, what about quantile regression, risk/probability estimates, and others that have a different context than a simple numeric prediction? Do we have different names and document them or should we keep a single label?

I have some brainstorming about this at https://github.com/tidymodels/broom/issues/452. My impression from working with broom is that we should use the most specific name possible. In some sense having a statistic column is nice because of the consistency, but I pretty much always find myself wonder what statistic is actually in that column. The documentation is cleaner that way too, since there's less conceptual overloading of the same term.

I just disagree with the defaultedness (sp?)

I agree, uncertainty should not be reported by default. What I'm playing in augment() right now is an se_fit argument that defaults to FALSE. Then for methods that don't support some measure of uncertainty, there just is no se_fit argument.

alexpghayes commented 6 years ago

And I really do think that we want to export the tests for whatever specification we come up with.

topepo commented 6 years ago

Yes, absolutely.

alexpghayes commented 6 years ago

Other thoughts:

Arguments should always default to NULL and missing arguments should error
The default behavior should be to produce a prediction of the same type as the original response data

This should also all end up in principles eventually.

alexpghayes commented 6 years ago

Another decision (related to #37): should uncertainty just be reported as a se.fit column, or should it be reported as intervals.

leeper commented 6 years ago

Also, just to throw out edge cases:

predictions from bayesian models typically aren't single values
sometimes features relevant to predictions are appended as attributes or other elements in the response lists from predict() methods

I'd say a data frame-like structure is the way to go as it's trivial to add columns. A difficult decision in prediction was whether to always include class probabilities and predicted class, or just one or the other. For my purposes, I decided to always do both but in practice it means callling most predict() methods twice.

topepo commented 6 years ago

A difficult decision in prediction was whether to always include class probabilities and predicted class, or just one or the other.

Some models don't generate class probabilities. I keep them separate as a default but provide an api to get both at once (with a single call to get the probabilities).

leeper commented 6 years ago

I should also add that I'd be really happy to see broom and prediction integrated into one package if we can find a way to make an integrated version serve both our purposes. The key functionality for me is actually for the margins package, which requires the following:

prediction::build_datalist() to get predictions over specified out-of-sample cases
Some of the convenience functions for getting specific predictions, like seq_range(), mean_or_mode(), etc. that are currently in prediction but could easily live somewhere else
A consistent data frame (or tibble) response from prediction() (or its equivalent)

I imagine it would be beneficial to the community to have one really good package doing this rather than various packages serving slightly different purposes.

topepo commented 6 years ago

If the scope is really big (and that sounds like the case), I don't think that one package is the answer. That's how caret got to where it is. Modular is better (generally).

parsnip is designed to be the unified interface to models (fitting, prediction, etc) but there is a lot that it doesn't do (e.g. resampling, tuning parameters, etc) and that is by design. This is not meant to imply that you should use it. Using it as an example, put the infrastructure is packages with reasonable scope, then create a package with a higher-level api that give you what you want.

What you're saying is good; let's agree on some standards and develop form there so we don't obstruct each other (and we can import each other without reformatting output).

alexpghayes commented 6 years ago

I agree that a standalone prediction package is a good idea. predictions is almost it, but doesn't provided all the behavioral guarantees we'd like just yet.

Re: bayesian predictions: tidybayes just hit CRAN. Developing an interface for manipulating posterior samples certainly is valuable, but it's not something I feel wildly compelled to do at the moment. For a wrapper predict method, I think a reasonable standard would be to provide a point estimate and standard errors (or bounds on a credible interval).

leeper commented 6 years ago

If you think it will work for you with modifications, let me know what's missing and I'll try to get it all implemented.

topepo commented 6 years ago

How does this should for a prediction standard (pinging @hadley, @artichaud1, and @DavisVaughan for more input) :

Any code to produce predictions should follow these conventions:

The return value is a tibble with the same number of rows as the data being predicted. This implies that na.action should not affect the shape of the outcome object.
The return tibble can contain extra attributes for values relevant to the prediction (e.g. alpha for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g. arrange, filter, etc.).
For univariate, numeric point estimates, the column should be named pred. For multivariate numeric predictions (excluding probabilities), the columns should be named {outcome name}_pred.
For class probability predictions, the columns should be named the same as the factor levels (with back-ticks when they are not valid column names).
If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be lower and upper. If a standard error is produced, it should be named std_error.
For predictions that are not simple scalars, such as distributions or non-rectangular structures, the pred column should be a list-column. Examples:
- If a posterior distribution is returned for each sample, each element of the list column can be a tibble with as many rows as samples in the distribution.
- When a predict method produces values over multiple tuning parameter values (e.g. glmnet), the list column elements have rows for every tuning parameter value combination that is being changed (e.g. lambda in glmnet).
- In time to event models, where survivor probabilities are produced for values of time, the return tibble has a column for times.
- When using a quantile regression, one might make the median the default that is predicted. If multiple percentiles are requested, then pred would be a tibble with a column for the predictions and another for the percentile.
  - Class predictions should be factors with the same levels as the original outcome.
  - By default, newdata should not be returned by the prediction function.
  - The function to produce predictions should be a class-specific predict method with argument names object, newdata, and type. Other arguments, such as alpha, should be standardized and not named without discussion on RStudio Community.
  - The main predict method can defer to separate, unexported functions (predict_class, etc).
  - type should also come from a set of pre-defined values such as "response" (numeric values), "class", "prob", "link", and "raw". Other values should be assigned with consensus.
  - If newdata is not supplied, an error should be thrown.
  - The prediction code should work when newdata has multiple rows or a single row.
  - If there is a strong need for producing output during the prediction phase, there should be a verbose option that defaults to no output.

<edit 1> added "link"

<edit 2> added note about single point predictions.

<edit 3> another example for quantile regression.

leeper commented 6 years ago

Class predictions should be factors with the same levels as the original outcome.

Named?

type should also come from a set of pre-defined values such as "response" (numeric values), "class", "prob", and "raw". Other values should be assigned with consensus.

What about "link"?

If newdata is not supplied, an error should be thrown.

Very good!

If there is a strong need for producing output during the prediction phase, there should be a verbose option that defaults to no output.

How about getOption("verbose", FALSE)?

DavisVaughan commented 6 years ago

The return tibble can contain extra attributes for values relevant to the prediction (e.g. alpha for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g. arrange, filter, etc.).

Depending on the state of sloop::reconstruct() integration with dplyr, this could be really simple or a manual (but not difficult) process.

For class probability predictions, the columns should be named the same as the factor levels (with back-ticks when they are not valid column names).

Any reason to want to format like {factor_label}_prob instead? Just thinking about preemptively preventing any name clashes if you somehow joined the prob predictions to something that had the factor labels in a wide format already. Might also be a consistent companion to pred.

If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be lower and upper. If a standard error is produced, it should be named std_error.

Should this include the confidence level? lower_95 or something similar. See the forecast pkg for examples:

library(forecast)
fit <- auto.arima(WWWusage)
forecast(fit, h=20, level=95)
    Point Forecast    Lo 95    Hi 95
101       218.8805 212.6840 225.0770
102       218.1524 203.3133 232.9915
103       217.6789 194.1786 241.1792
104       217.3709 185.6508 249.0910
...

topepo commented 6 years ago

Named?

Not sure what you mean.

How about getOption("verbose", FALSE)?

Maybe... there might be a lot of verbose globals that get made so it would have to be something like my_model_verbose. There are good arguments to make that verbosity should be able to be set at different levels (e.g. model fit, feature selection, etc).

topepo commented 6 years ago

@DavisVaughan

Any reason to want to format like {factor_label}_prob instead? Just thinking about preemptively preventing any name clashes if you somehow joined the prob predictions to something that had the factor labels in a wide format already. Might also be a consistent companion to pred.

That's good for standardization but bad for consuming those values. So if I go to make a ggplot, then a lot of rename_at will be going on.

Let me think about that.

Should this include the confidence level?

No. Then you've encoding data into the name and you can't predict what the name will be easily. I'd rather set an attribute for the level (so that it is there) but not add it to the same or the value.

topepo commented 6 years ago

@leeper

Named?



Sorry, I get it now.

My inclination is to also call it `pred` but I'm torn between that and `class_pred`. I plan on making a new data structure with S3 class `class_pred` so maybe `pred_class`? 

That could get out of hand easily though. Do we end up with `pred_time`, `pred_post_mode` and so on.

leeper commented 6 years ago

Agree it's messy but I think it's sensible to keep class predictions as a separate thing. For example, for margins, I use predictions for numerical derivatives and want to always be able to count on those being numeric.

alexpghayes commented 6 years ago

Naming: I'm not a huge fan of pred. See also some conventions the Stan crew just went with. I think full singular nouns is the way to go.

The return tibble can contain extra attributes for values relevant to the prediction (e.g. alpha for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g. arrange, filter, etc.).

I think that ideally as much information as possible should be contained in the tibble itself. Users are much more familiar working with tibbles than attributes. For example, for intervals, I'd strongly prefer a column width to an attribute. This would make working with multiple intervals at once nicer as well (for plotting, say).

For class probability predictions, the columns should be named the same as the factor levels (with back-ticks when they are not valid column names).

No indication that the columns are probabilities in the name?

If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be lower and upper. If a standard error is produced, it should be named std_error.

Yes.

If a posterior distribution is returned for each sample, each element of the list column can be a tibble with as many rows as samples in the distribution.

Not sold on this. For a high level interface, we should not force users to interact with posterior samples. We should provide a default summary of those samples and only optionally return them.

When a predict method produces values over multiple tuning parameter values (e.g. glmnet), the list column elements have rows for every tuning parameter value combination that is being changed (e.g. lambda in glmnet).

I think we need to think very carefully if we want to do this. My take is that so far this entire thread has been defining a predict() method for a single fitted model object. When you have multiple hyperparameter combinations, you are fundamentally working with multiple fitted model objects. I'm not convinced it even makes sense to define predict() in this case. If we are going to define fit at all, I think one sensibly option is:

predict.many_fits <- function(...) {
   best_fit <- extract_best_fit(many_fits)
   predict(best_fit)
}

to get predictions for different hyperparameter values I'd much prefer a workflow like:

map(many_fits, predict) # many_fits basically a list of fitted objects

But I think we should be really careful here.

If newdata is not supplied, an error should be thrown.

Strongly second this, and came here to recommend it. This is an issue because models often drop rows with missing data silently, which leads to all sorts of nasty surprises and after the fact attempts to determine precisely which rows got dropped.

topepo commented 6 years ago

Naming: I'm not a huge fan of pred. See also some conventions the Stan crew just went with. I think full singular nouns is the way to go.

How about .pred, .pred_class, .pred_{class levels}? .predicted or .prediction is too long, especially when combined with class levels.

I think that ideally as much information as possible should be contained in the tibble itself.

I don't want to do that. It's replicating data that has a single value into a column that is potentially large.

Not sold on this. For a high level interface, we should not force users to interact with posterior samples. We should provide a default summary of those samples and only optionally return them.

That's one example of many where the prediction isn't a scalar. Even so, I've had applications where I've needed the full posterior distribution for each sample.

I think we need to think very carefully if we want to do this. My take is that so far this entire thread has been defining a predict() method for a single fitted model object. When you have multiple hyperparameter combinations, you are fundamentally working with multiple fitted model objects. I'm not convinced it even makes sense to define predict() in this case.

I'm only talking about doing it in the cases where the predict method make simultaneous predictions across many parameters from a single fitted model object. That's the whole "submodel trick" that caret uses for efficiency and it would be lost by repeated calling of predict for each parameter value using the same model object (when the underlying predict code gives them all to you at once).

klahrich commented 6 years ago

I think this is better for other developers extending existing work, since they are less likely to make assumptions about which rows are present than which columns are present.

Not sure if this is still up for debate, and although I favored wide format in the past, more often than not I ended up going to long format for analysis, because it allowed me to standardize my workflow:: filter, group_by, summarise -> send to ggplot.

In the same vein as @alexpghayes point, I feel that it removes some of the weight in the decision making, i.e. it makes it easier to change ideas whenever, with a lower probability (or severity) of breaking stuff.

kevinykuo commented 6 years ago

This one pops up on my dashboard every day, so I'll join the party ;)

Regarding class probabilities, since we're using tibble anyway, why not just have the probabilities in a listcol? I think it's good to avoid wide tables by default. This would also allow .prediction as a column name, which is clearer than pred/.pred. Also, we might not always have names of factor levels.

topepo commented 6 years ago

Also, we might not always have names of factor levels.

spark may not, but that will be illegal in the modeling new world order 😃

topepo commented 6 years ago

Regarding class probabilities, since we're using tibble anyway, why not just have the probabilities in a listcol? I think it's good to avoid wide tables by default.

They are probably going to end up wide in a lot of cases anyway.

In classification, people probably want the class prediction as well as the probabilities, so that's already > 1 column even if we make the probabilities a list. tidyr (or built in gather/spread methods) can solve going from wide -> long easily.

kevinykuo commented 6 years ago

In classification, people probably want the class prediction as well as the probabilities, so that's already > 1 column even if we make the probabilities a list.

These are definitely subjective points but imo

4-5 columns is one thing, but 1000 class probability columns doesn't feel as tidy, and
It's easier to specify the contract/expectations of what you get if you have long rather than wide data.

topepo commented 6 years ago

4-5 columns is one thing, but 1000 class probability columns doesn't feel as tidy, and

True but I don't want extreme cases to drive the bus

It's easier to specify the contract/expectations of what you get if you have long rather than wide data.

Unless you are going to merge these results with the original data (which is pretty common). Plus, you already have wide data anyway with classes and probabilities. If you stack the probabilities then you have a class column that is replicated. Why duplicate data?

kevinykuo commented 6 years ago

Oh I didn't mean we should stack probabilities (although upon re-reading it does sound like it!). I'm thinking about appending .prediction, .predicted_class, and .class_probabilities to the dataset for classification. Having to write paste0("pred_", lvl) to get the right column feels brittle, although practically it likely won't matter. I guess going down this line of reasoning, we'd also have .prediction_interval and friends (and .prediction if we have multiple outputs) as list columns, and this may deviate too much from what folks are used to. So yeah, if we're going to do lower and upper, it would be consistent to have pred_setosa and pred_versicolor. 😄

topepo commented 6 years ago

I think that parsnip will have functions to make predictions across submodels (maybe predict_submodel?). I think that I'm leaning towards restricting predict to produce a single model prediction per object.

Similarly, add predict_interval that will produce .pred_lower and .pred_upper and also add predict(object, newdata, type = "interval") that references it. I'll prototype it with lm, stan, and mars.

<edit> This would be in the service about predict being restricted to returning the same number of rows as newdata.

DavisVaughan commented 6 years ago

So for predict_boost_tree() would it have a type argument as well so you can specify class vs prob? And then it would just have a default type based off mode if no type is specified, like the current predict.model_fit() has?

And you could call predict(boost_fit, newdata, type = "whatever") which would just call predict_boost_tree()?

The separate prediction functions per submodel might be a nice way to modularize any nuances that might come up for the prediction methods of a specific model type, and would be pretty extensible.

topepo commented 6 years ago

So for predict_boost_tree() would it have a type argument as well so you can specify class vs prob? And then it would just have a default type based off mode if no type is specified, like the current predict.model_fit() has?

Close. There is not predict_boost_tree, just predict.model_spec. I'm about to merge in a devel branch that shows how predict will pass off to different functions. I should just merge that it.

And you could call predict(boost_fit, newdata, type = "whatever") which would just call predict_boost_tree()?

It will call predict.model_spec use use type to redirect.

juliasilge commented 4 years ago

Prediction norms have been fairly standardized for us now. Thanks to all for your thoughts! 🙌

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / parsnip

Separate backend for tidy prediction #41