Closed alexpghayes closed 4 years ago
Returns a tidy tibble
That would be the most tidy thing to do but my first thought is that it should return a vector unless the prediction contains multiple columns (e.g. class probs, multivariate Y). Let me ponder this a bit.
Never drops missing data
I agree on this. It is something that can happen after prediction if people really don't want those data.
As a corollary, we should always return the same length/nrow as the input even when the prediction produces multiple columns. That may not be automatically 100% tidy, but I think that it is good practice.
Consistent naming of fitted values
👍
We should think about the names though. Let's say that we choose pred
as the name (just for argument). For numeric outputs, what about quantile regression, risk/probability estimates, and others that have a different context than a simple numeric prediction? Do we have different names and document them or should we keep a single label?
Uncertainty in predictions
Yes, but not be default (if that's what you mean) since 1) most models don't implement them and 2) they might be computationally taxing and, if you don't want them, why would you take the time to generate them. We should have a formal syntax for getting them though; I just disagree with the defaultedness (sp?)
That would be the most tidy thing to do but my first thought is that it should return a vector unless the prediction contains multiple columns (e.g. class probs, multivariate Y). Let me ponder this a bit.
The major argument for always having tibble predictions is type safety. I think it's the right move, especially since yardstick
expect predictions to live in a tibble and we've been telling everyone to put everything in a tibble for some time now.
As a corollary, we should always return the same length/nrow as the input even when the prediction produces multiple columns. That may not be automatically 100% tidy, but I think that it is good practice.
This is something I'm running into with broom::augment()
right now that isn't as straightforward as I would have hoped. Requiring that nrow(predictions) == nrow(data)
makes sense for univariate prediction, but multivariate models sometimes have a more natural representation in long form. For example, considering the joineRML
augment method:
library(joineRML)
#> Loading required package: nlme
#> Loading required package: survival
data(heart.valve)
hvd <- heart.valve[!is.na(heart.valve$log.grad) &
!is.na(heart.valve$log.lvmi) &
heart.valve$num <= 50, ]
fit <- mjoint(
formLongFixed = list(
"grad" = log.grad ~ time + sex + hs,
"lvmi" = log.lvmi ~ time + sex
),
formLongRandom = list(
"grad" = ~ 1 | num,
"lvmi" = ~ time | num
),
formSurv = Surv(fuyrs, status) ~ age,
data = hvd,
inits = list("gamma" = c(0.11, 1.51, 0.80)),
timeVar = "time"
)
#> Running multivariate LMM EM algorithm to establish initial parameters...
#> Finished multivariate LMM EM algorithm...
#> EM algorithm has converged!
#> Calculating post model fit statistics...
au <- broom::augment(fit)
colnames(au)
#> [1] "num" "sex" "age" "time"
#> [5] "fuyrs" "status" "grad" "log.grad"
#> [9] "lvmi" "log.lvmi" "ef" "bsa"
#> [13] "lvh" "prenyha" "redo" "size"
#> [17] "con.cabg" "creat" "dm" "acei"
#> [21] "lv" "emergenc" "hc" "sten.reg.mix"
#> [25] "hs" ".fitted_grad_0" ".fitted_lvmi_0" ".fitted_grad_1"
#> [29] ".fitted_lvmi_1" ".resid_grad_0" ".resid_lvmi_0" ".resid_grad_1"
#> [33] ".resid_lvmi_1"
Created on 2018-08-07 by the reprex package (v0.2.0).
Here it would make sense to have an id / outcome
column and .fitted
and .resid
rather than .fitted_outcome_1
, .resid_outcome_1
, etc.
Dave pretty strongly believes that it is better for users to spread
data than to gather
it. I'm not entirely sure how I feel about this, but the nice thing about returning data in a long format is you have predictable columns and unpredictable rows, as opposed to predictable rows and unpredictable columns. I think this is better for other developers extending existing work, since they are less likely to make assumptions about which rows are present than which columns are present.
We should think about the names though. Let's say that we choose pred as the name (just for argument). For numeric outputs, what about quantile regression, risk/probability estimates, and others that have a different context than a simple numeric prediction? Do we have different names and document them or should we keep a single label?
I have some brainstorming about this at https://github.com/tidymodels/broom/issues/452. My impression from working with broom is that we should use the most specific name possible. In some sense having a statistic
column is nice because of the consistency, but I pretty much always find myself wonder what statistic is actually in that column. The documentation is cleaner that way too, since there's less conceptual overloading of the same term.
I just disagree with the defaultedness (sp?)
I agree, uncertainty should not be reported by default. What I'm playing in augment()
right now is an se_fit
argument that defaults to FALSE
. Then for methods that don't support some measure of uncertainty, there just is no se_fit
argument.
And I really do think that we want to export the tests for whatever specification we come up with.
Yes, absolutely.
Other thoughts:
NULL
and missing arguments should errorThis should also all end up in principles
eventually.
Another decision (related to #37): should uncertainty just be reported as a se.fit
column, or should it be reported as intervals.
Also, just to throw out edge cases:
predict()
methodsI'd say a data frame-like structure is the way to go as it's trivial to add columns. A difficult decision in prediction was whether to always include class probabilities and predicted class, or just one or the other. For my purposes, I decided to always do both but in practice it means callling most predict()
methods twice.
A difficult decision in prediction was whether to always include class probabilities and predicted class, or just one or the other.
Some models don't generate class probabilities. I keep them separate as a default but provide an api to get both at once (with a single call to get the probabilities).
I should also add that I'd be really happy to see broom and prediction integrated into one package if we can find a way to make an integrated version serve both our purposes. The key functionality for me is actually for the margins package, which requires the following:
prediction::build_datalist()
to get predictions over specified out-of-sample casesseq_range()
, mean_or_mode()
, etc. that are currently in prediction but could easily live somewhere elseprediction()
(or its equivalent)I imagine it would be beneficial to the community to have one really good package doing this rather than various packages serving slightly different purposes.
If the scope is really big (and that sounds like the case), I don't think that one package is the answer. That's how caret
got to where it is. Modular is better (generally).
parsnip
is designed to be the unified interface to models (fitting, prediction, etc) but there is a lot that it doesn't do (e.g. resampling, tuning parameters, etc) and that is by design. This is not meant to imply that you should use it. Using it as an example, put the infrastructure is packages with reasonable scope, then create a package with a higher-level api that give you what you want.
What you're saying is good; let's agree on some standards and develop form there so we don't obstruct each other (and we can import each other without reformatting output).
I agree that a standalone prediction package is a good idea. predictions
is almost it, but doesn't provided all the behavioral guarantees we'd like just yet.
Re: bayesian predictions: tidybayes
just hit CRAN. Developing an interface for manipulating posterior samples certainly is valuable, but it's not something I feel wildly compelled to do at the moment. For a wrapper predict method, I think a reasonable standard would be to provide a point estimate and standard errors (or bounds on a credible interval).
If you think it will work for you with modifications, let me know what's missing and I'll try to get it all implemented.
How does this should for a prediction standard (pinging @hadley, @artichaud1, and @DavisVaughan for more input) :
Any code to produce predictions should follow these conventions:
na.action
should not affect the shape of the outcome object. arrange
, filter
, etc.). pred
. For multivariate numeric predictions (excluding probabilities), the columns should be named {outcome name}_pred
. lower
and upper
. If a standard error is produced, it should be named std_error
. pred
column should be a list-column. Examples:
predict
method produces values over multiple tuning parameter values (e.g. glmnet
), the list column elements have rows for every tuning parameter value combination that is being changed (e.g. lambda
in glmnet
). time
, the return tibble has a column for times
. pred
would be a tibble with a column for the predictions and another for the percentile.
newdata
should not be returned by the prediction function. predict
method with argument names object
, newdata
, and type
. Other arguments, such as alpha
, should be standardized and not named without discussion on RStudio Community.predict_class
, etc). type
should also come from a set of pre-defined values such as "response" (numeric values), "class", "prob", "link", and "raw". Other values should be assigned with consensus. newdata
is not supplied, an error should be thrown. newdata
has multiple rows or a single row. verbose
option that defaults to no output. <edit 1>
added "link"
<edit 2>
added note about single point predictions.
<edit 3>
another example for quantile regression.
Class predictions should be factors with the same levels as the original outcome.
Named?
type should also come from a set of pre-defined values such as "response" (numeric values), "class", "prob", and "raw". Other values should be assigned with consensus.
What about "link"?
If newdata is not supplied, an error should be thrown.
Very good!
If there is a strong need for producing output during the prediction phase, there should be a verbose option that defaults to no output.
How about getOption("verbose", FALSE)
?
The return tibble can contain extra attributes for values relevant to the prediction (e.g. alpha for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g. arrange, filter, etc.).
Depending on the state of sloop::reconstruct()
integration with dplyr
, this could be really simple or a manual (but not difficult) process.
For class probability predictions, the columns should be named the same as the factor levels (with back-ticks when they are not valid column names).
Any reason to want to format like {factor_label}_prob
instead? Just thinking about preemptively preventing any name clashes if you somehow joined the prob predictions to something that had the factor labels in a wide format already. Might also be a consistent companion to pred
.
If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be lower and upper. If a standard error is produced, it should be named std_error.
Should this include the confidence level? lower_95
or something similar. See the forecast
pkg for examples:
library(forecast)
fit <- auto.arima(WWWusage)
forecast(fit, h=20, level=95)
Point Forecast Lo 95 Hi 95
101 218.8805 212.6840 225.0770
102 218.1524 203.3133 232.9915
103 217.6789 194.1786 241.1792
104 217.3709 185.6508 249.0910
...
Named?
Not sure what you mean.
How about
getOption("verbose", FALSE)
?
Maybe... there might be a lot of verbose globals that get made so it would have to be something like my_model_verbose
. There are good arguments to make that verbosity should be able to be set at different levels (e.g. model fit, feature selection, etc).
@DavisVaughan
Any reason to want to format like
{factor_label}_prob
instead? Just thinking about preemptively preventing any name clashes if you somehow joined the prob predictions to something that had the factor labels in a wide format already. Might also be a consistent companion topred
.
That's good for standardization but bad for consuming those values. So if I go to make a ggplot
, then a lot of rename_at
will be going on.
Let me think about that.
Should this include the confidence level?
No. Then you've encoding data into the name and you can't predict what the name will be easily. I'd rather set an attribute for the level (so that it is there) but not add it to the same or the value.
@leeper
Named?
Sorry, I get it now.
My inclination is to also call it `pred` but I'm torn between that and `class_pred`. I plan on making a new data structure with S3 class `class_pred` so maybe `pred_class`?
That could get out of hand easily though. Do we end up with `pred_time`, `pred_post_mode` and so on.
Agree it's messy but I think it's sensible to keep class predictions as a separate thing. For example, for margins, I use predictions for numerical derivatives and want to always be able to count on those being numeric.
Naming: I'm not a huge fan of pred
. See also some conventions the Stan crew just went with. I think full singular nouns is the way to go.
The return tibble can contain extra attributes for values relevant to the prediction (e.g. alpha for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g. arrange, filter, etc.).
I think that ideally as much information as possible should be contained in the tibble itself. Users are much more familiar working with tibbles than attributes. For example, for intervals, I'd strongly prefer a column width
to an attribute. This would make working with multiple intervals at once nicer as well (for plotting, say).
For class probability predictions, the columns should be named the same as the factor levels (with back-ticks when they are not valid column names).
No indication that the columns are probabilities in the name?
If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be lower and upper. If a standard error is produced, it should be named std_error.
Yes.
If a posterior distribution is returned for each sample, each element of the list column can be a tibble with as many rows as samples in the distribution.
Not sold on this. For a high level interface, we should not force users to interact with posterior samples. We should provide a default summary of those samples and only optionally return them.
When a predict method produces values over multiple tuning parameter values (e.g. glmnet), the list column elements have rows for every tuning parameter value combination that is being changed (e.g. lambda in glmnet).
I think we need to think very carefully if we want to do this. My take is that so far this entire thread has been defining a predict()
method for a single fitted model object. When you have multiple hyperparameter combinations, you are fundamentally working with multiple fitted model objects. I'm not convinced it even makes sense to define predict()
in this case. If we are going to define fit at all, I think one sensibly option is:
predict.many_fits <- function(...) {
best_fit <- extract_best_fit(many_fits)
predict(best_fit)
}
to get predictions for different hyperparameter values I'd much prefer a workflow like:
map(many_fits, predict) # many_fits basically a list of fitted objects
But I think we should be really careful here.
If newdata is not supplied, an error should be thrown.
Strongly second this, and came here to recommend it. This is an issue because models often drop rows with missing data silently, which leads to all sorts of nasty surprises and after the fact attempts to determine precisely which rows got dropped.
Naming: I'm not a huge fan of pred. See also some conventions the Stan crew just went with. I think full singular nouns is the way to go.
How about .pred
, .pred_class
, .pred_{class levels}
? .predicted
or .prediction
is too long, especially when combined with class levels.
I think that ideally as much information as possible should be contained in the tibble itself.
I don't want to do that. It's replicating data that has a single value into a column that is potentially large.
Not sold on this. For a high level interface, we should not force users to interact with posterior samples. We should provide a default summary of those samples and only optionally return them.
That's one example of many where the prediction isn't a scalar. Even so, I've had applications where I've needed the full posterior distribution for each sample.
I think we need to think very carefully if we want to do this. My take is that so far this entire thread has been defining a
predict()
method for a single fitted model object. When you have multiple hyperparameter combinations, you are fundamentally working with multiple fitted model objects. I'm not convinced it even makes sense to definepredict()
in this case.
I'm only talking about doing it in the cases where the predict
method make simultaneous predictions across many parameters from a single fitted model object. That's the whole "submodel trick" that caret
uses for efficiency and it would be lost by repeated calling of predict
for each parameter value using the same model object (when the underlying predict code gives them all to you at once).
I think this is better for other developers extending existing work, since they are less likely to make assumptions about which rows are present than which columns are present.
Not sure if this is still up for debate, and although I favored wide format in the past, more often than not I ended up going to long format for analysis, because it allowed me to standardize my workflow:: filter, group_by, summarise -> send to ggplot.
In the same vein as @alexpghayes point, I feel that it removes some of the weight in the decision making, i.e. it makes it easier to change ideas whenever, with a lower probability (or severity) of breaking stuff.
This one pops up on my dashboard every day, so I'll join the party ;)
Regarding class probabilities, since we're using tibble anyway, why not just have the probabilities in a listcol? I think it's good to avoid wide tables by default. This would also allow .prediction
as a column name, which is clearer than pred/.pred
. Also, we might not always have names of factor levels.
Also, we might not always have names of factor levels.
spark may not, but that will be illegal in the modeling new world order 😃
Regarding class probabilities, since we're using tibble anyway, why not just have the probabilities in a listcol? I think it's good to avoid wide tables by default.
They are probably going to end up wide in a lot of cases anyway.
In classification, people probably want the class prediction as well as the probabilities, so that's already > 1 column even if we make the probabilities a list. tidyr
(or built in gather
/spread
methods) can solve going from wide -> long easily.
In classification, people probably want the class prediction as well as the probabilities, so that's already > 1 column even if we make the probabilities a list.
These are definitely subjective points but imo
4-5 columns is one thing, but 1000 class probability columns doesn't feel as tidy, and
True but I don't want extreme cases to drive the bus
It's easier to specify the contract/expectations of what you get if you have long rather than wide data.
Unless you are going to merge these results with the original data (which is pretty common). Plus, you already have wide data anyway with classes and probabilities. If you stack the probabilities then you have a class column that is replicated. Why duplicate data?
Oh I didn't mean we should stack probabilities (although upon re-reading it does sound like it!). I'm thinking about appending .prediction
, .predicted_class
, and .class_probabilities
to the dataset for classification. Having to write paste0("pred_", lvl)
to get the right column feels brittle, although practically it likely won't matter. I guess going down this line of reasoning, we'd also have .prediction_interval
and friends (and .prediction
if we have multiple outputs) as list columns, and this may deviate too much from what folks are used to. So yeah, if we're going to do lower
and upper
, it would be consistent to have pred_setosa
and pred_versicolor
. 😄
I think that parsnip
will have functions to make predictions across submodels (maybe predict_submodel
?). I think that I'm leaning towards restricting predict
to produce a single model prediction per object.
Similarly, add predict_interval
that will produce .pred_lower
and .pred_upper
and also add predict(object, newdata, type = "interval")
that references it. I'll prototype it with lm
, stan
, and mars
.
<edit>
This would be in the service about predict
being restricted to returning the same number of rows as newdata
.
So for predict_boost_tree()
would it have a type
argument as well so you can specify class
vs prob
? And then it would just have a default type
based off mode
if no type
is specified, like the current predict.model_fit()
has?
And you could call predict(boost_fit, newdata, type = "whatever")
which would just call predict_boost_tree()
?
The separate prediction functions per submodel might be a nice way to modularize any nuances that might come up for the prediction methods of a specific model type, and would be pretty extensible.
So for
predict_boost_tree()
would it have atype
argument as well so you can specifyclass
vsprob
? And then it would just have a defaulttype
based offmode
if notype
is specified, like the currentpredict.model_fit()
has?
Close. There is not predict_boost_tree
, just predict.model_spec
. I'm about to merge in a devel branch that shows how predict
will pass off to different functions. I should just merge that it.
And you could call
predict(boost_fit, newdata, type = "whatever")
which would just callpredict_boost_tree()
?
It will call predict.model_spec
use use type
to redirect.
Prediction norms have been fairly standardized for us now. Thanks to all for your thoughts! 🙌
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
I've been reworking the
augment()
methods and it's rapidly becoming clear that dealing with idiosyncraticpredict()
methods is going to slow down progress immensely.In the end
broom
andparsnip
are both going to want to wrap a bajillion predict methods, and we should report predictions in the same way for consistencies sake. I think we should move this functionality to a separate package. Potentially we could use theprediction
package by Thomas Leeper, but we should decide on the behavior we want first.If we define a new generic / series of generics, we can then test these behaviors in
modeltests
and allow other modelling package developers to guarantee that theirpredict()
methods are sane and consistent.What I want from a predict method:
predict.lm(..., na.action = na.pass)
)I want all of these to be guaranteed, and for methods that cannot meet these guarantees, I want an informative error rather than a partially correct output.