Add model() and estimate() generics

mitchelloharawild commented 5 years ago

These generics are used within fable models for producing a single model (estimate()) or a collection of models (model()) from model definitions.

To avoid "namespace grabbing", these generics should probably be moved to generics before the first CRAN release of the fable packages.

I'm also happy to open a dialogue for how these generics should be used, with regards to their default arguments.

Relevant discussion detailing the usage of model(), and to a lesser extent estimate(): https://github.com/tidyverts/fable/issues/91

topepo commented 5 years ago

Those seem pretty reasonable but I think that it would be good to have the generic for estimate only involve the second argument to be less restrictive.

We also have the convention to make the first argument x whenever possible. That might seem like we are being uptight, but these generics are meant to be used in a broad context. For example:

estimate <- function(.data, .model){
  UseMethod("estimate")
}

I would want to use this on an existing model object (where the data have already been consumed). For example, I wrote an S package for Shewhart charts a long time ago and wanted a generic that would return estimates of the process mean and variance (mean and var weren't generic back then). This signature would preclude something like that.

Are you doing double dispatch on these two objects? If not, you wouldn't lose anything by using a single argument.

mitchelloharawild commented 5 years ago

Completely agree with the usage of estimate(). Actually I changed this for the same reason last week in https://github.com/tidyverts/fablelite/commit/d34f1c7c33e716d58fc4b3d773eb3b97f975a72a.

The only reason it was so restrictive, was because I had originally planned to use this functionality only internally.

I don't feel too strongly about the name of the first argument, however I think using .data will restrict the context of the generic in a beneficial way. In what scenarios do you anticipate estimate() to be used without data as the first argument? Consistency for the usage of this generic may result in less cognitive load for the users.

topepo commented 5 years ago

In what scenarios do you anticipate estimate() to be used without data as the first argument

One thing that I'd use it for is unsupervised methods. So if I have an object with PCA loadings, I'd use estimate(object, new_data = df) to get projections for new data points.

mitchelloharawild commented 5 years ago

Sure, x or object is fine. So generics for model(x, ...) and estimate(x, ...)?

How general do you think the documentation of their functionality should be? Should it distinguish functionality between these generics, or should the methods have flexibility to use them inconsistently.

For example, in fable these generics would be used as follows: model.tbl_ts(x, ...) trains multiple model definitions to data, where x is a tsibble, and ... are the model definitions. estimate.tbl_ts(x, .model, ...) trains a single model definition to data, where x is a tsibble and .model is the model definition. ... is unused.

Having some recommended usage of these verbs would make it easier for users to learn their functionality, it would also make them less flexible.

topepo commented 5 years ago

I'd suggest x for both. I wouldn't really get too specific about how we think that these should be used. I think that the doc files can give examples of what existing methods do.

mitchelloharawild commented 5 years ago

Sounds reasonable. I'll work on this a bit and make a PR.

hadley commented 5 years ago

I think it's most important that you give some thought to the type signature of the generic — i.e. what does it return? Does it return a data frame? A tibble? An object of the same type as x?

mitchelloharawild commented 5 years ago

The implementation in the fable series of packages is:

model(.data, ...)

Returns a mable object (a tibble with model attributes).
Rows of models are identified by groups of the input .data (keys+groups in a tsibble).
Columns of models are specified in the ....
Cells are the result from a call to estimate() with the appropriate data split and model definition.
summarise-esque semantics. Respects groups and reduces data into summary statistics (model fit parameters).
Input is a tsibble (tibble), output is a mable (tibble)

estimate(.data, .model, ...)

Returns a model object (a list containing the model specification, response, transformation, and the fit object (result from model training method).
Input is a tsibble (tibble), output is a model (list)

edit: Using estimate() is discouraged, but is exported to allow users access to the lower level objects if they're particularly inquisitive. It also makes the nest-map-unnest workflow better if they're uncomfortable with using model().

These functions dispatch on a data object, and so if a similar approach is supported for cross-sectional modelling there would not be many more methods required. So I think the purpose of this generic may be less about consistent functionality, but more about avoiding namespace conflicts.

You could also argue that estimate() should dispatch on .model rather than .data, which could make it easier to define model training methods. Currently fable keeps the model's training method in the R6 class for the model definition.

r-lib / generics

Add model() and estimate() generics #42