Method to ensemble via direct summarization of component predictions

elray1 commented 1 year ago

This proposal is for a function that calculates ensemble predictions by directly summarizing component predictions. The function supports weighted summaries (e.g., weighted mean or median), but requires that the weights have been calculated previously.

Here's a proposed function definition block:

#' Compute ensemble predictions by summarizing component predictions for each
#' combination of model task, output type, and type id. Supported output types
#' include `mean`, `median`, `quantile`, `cdf`, and `category`.
#' 
#' @param predictions a `data.frame` with component model predictions. It should
#'   have columns `team_abbr`, `model_abbr`, one column for each task id
#'   variable, `type`, `type_id`, and `value`.
#' @param task_id_vars an optional `character` vector naming the columns of
#'   `predictions` that correspond to task id variables. The default is `NULL`,
#'   in which case the task id variables are looked up from the `hub_con` if one
#'   is provided. If neither `task_id_vars` nor `hub_con` are provided, all
#'   columns in `predictions` _other than_ `team_abbr`, `model_abbr`,
#'   `type`, `type_id`, and `value` will be used as task id
#'   variables.
#' @param hub_con an optional hub connection object; see `hubUtils::connect_hub`
#' @param weights an optional `data.frame` with component model weights. If
#'   provided, it should have columns `team_name`, `model_abbr`, `weight`,
#'   and optionally, additional columns corresponding to task id variables,
#'   `output_type`, or `output_id`, if weights are specific to values of
#'   those variables. The default is `NULL`, in which case an equally-weighted
#'   ensemble is calculated.
#' @param agg_fun a function or character string name of a function to use for
#'   aggregating component predictions into the ensemble prediction. The default
#'   is `mean`, in which case the ensemble prediction is the simple average of
#'   the component model predictions. The provided function should have an
#'   argument `x` for the vector of numeric values to summarize, and for weighted
#'   methods, an argument `w` with a numeric vector of weights.
#' @param agg_args a named list of any additional arguments that will be passed
#'   to `agg_fun`.
#' @param team_abbr, model_abbr `character` strings with the name of the team
#'   and model to use for the ensemble predictions.
#'
#' @return a data.frame with columns `team_abbr`, `model_abbr`, one column for
#'   each task id variable, `output_type`, `output_id`, and `value`.
ensemble <- function(predictions, task_id_vars = NULL, hub_con = NULL,
                     weights = NULL, agg_fun = mean, agg_args = list(),
                     team_abbr = "Hub", model_abbr = "ensemble")

The core of the logic would look something like this (completely untested code):

forecast_data %>%
    dplyr::left_join(weights) %>%
    dplyr::group_by(all_of(c(task_id_vars, “type”, “type_id”))) %>%
    dplyr::summarize(value = do.call(agg_fun, args = c(agg_args, list(x=value, w=weight)))) %>%
    dplyr::mutate(team_abbr = team_abbr, model_abbr = model_abbr)

Some things that the function should validate include:

predictions has the expected columns; if not, throw an error.
If provided, weights has the expected columns; if not, throw an error.
The output_types are supported. In the current documenation here, supported output types include mean, median, quantile, cdf, category. Unsupported output types include sample.
- How should we handle a situation where the input predictions contain an unsupported output type? Throw an error? Throw a warning, drop the unsupported output types, and compute ensemble predictions for the supported output types?

nickreich commented 1 year ago

This all looks good to me. I don't have a strong feeling about handling of unsupported output types. Maybe a slight lean towards just throwing an error.

eahowerton commented 1 year ago

I think the logic currently assumes that agg_fun will be applied to the same output type as what is provided in forecast_data (e.g., if output_type = quantile and agg_fun = mean, this function will calculate the mean value at each quantile). In SMH, we have output_type = quantile but when we aggregate, we calculate a (weighted) mean of the values (or the cdf type output type), not the quantiles.

Generally, I'm thinking about quantile and cdf output types as two ways of discretizing a CDF, so any aggregation method for CDFs could be applied to either. But, I recognize that this disrupts the simple logic considerably.

elray1 commented 1 year ago

Thanks, @eahowerton. This is a good point. Just trying to restate it in a different way, one way of thinking about the issue is that some ensemble methods require that we don't group by the output_id; in your example, if we have quantile forecasts but want to use a linear pool, the ensemble method will need to see predictions from all quantile levels to estimate the component cdfs that will be combined.

I discussed this offline briefly with Nick and Anna, and we thought that while we could in principle imagine a way to modify the logic I outlined above to handle this kind of situation, it would make the logic quite a bit more complex (as Emily also noted). Additionally, there are other differences between the kinds of methods I outlined here and methods like a linear pool: as one example, it's not clear that the same output_types are supported; I'm not sure that a linear pool is very meaningful as an ensemble method if only a mean is provided.

Our proposal is to plan for the package to provide several smaller functions that implement specific methods rather than writing one general function that handles all special cases:

Keep the relatively simple function I outlined above, perhaps renaming it to something like simple_ensemble (I'm not very happy with that name, but couldn't immediately think of anything better. I'm open to suggestions!)
Then, we could write separate functions for other ensemble methods that don't easily fit into the pattern above. For example, maybe we would have a linear_pool function.
At some point later on, consider writing a higher-level ensemble function that's a high-level wrapper to any lower-level ensembling functions like those in the first two points. It might have an argument like method that specifies which of those lower-level functions to call.

elray1 commented 1 year ago

I've been continuing to think about this a bit, and it seems to me that there are a few things here that are worth more discussion/thinking:

Maybe we should use ellipses ... to pass arguments to the agg_fun, rather than a named list agg_args. I think this is closer to an R convention.
Might it be worth allowing for a more flexible specification of the column names? The proposal above hard-codes the use of “output_type”, “output_id”, and "value" for the relevant column names. However, if we let these be specified as arguments to the function (with those values as defaults), I think the functionality would then directly be applicable to existing hubs (e.g. US COVID Forecast Hub) that don't necessarily use those column names.
I've been thinking more about what the right API/organization is to support instances where there are estimated model weights. I think in R, the most standard organization is to have two functions: one that estimates model parameters, and a second that is a predict method. e.g., think of fit <- lm(y ~ x); predict(fit, newdata). It seems like it might be a good idea to build around this general pattern. The function I specified above is more like a predict function than a model-fitting function (it assumes the weights have already been estimated, and generates ensemble predictions). Might we want to rename it to predict.our_ensemble_S3_class and make another function more analogous to lm that does fitting (if necessary) and creates an S3 object of class out_ensemble_S3_class that contains the weights and can be passed into a function like the one I outlined here?

elray1 commented 1 year ago

r.e. 3 -- it occurred to me that the other prevalent convention in R is functions like weighted.mean. The original proposal is more like those in spirit than lm/predict.

nickreich commented 1 year ago

Noting that we might want to add an argument that specifies whether or not to include forecasts that have only subsets of specific task-id variables. E.g. should you include a forecast for one location that only has 2 horizons when 4 is expected/requested for the ensemble? But note that it may be ok to have a model that has left out one location and we wouldn't want all of that models' forecasts to be excluded entirely.

Vector of task-id variables that are required to be complete for a given set of other task id vars? this may need more thinking.

nickreich commented 1 year ago

Consider instrumenting the ensemble function to print out (if verbose=TRUE?) info like:

"ensemble built for task id combination xxx-yyy-zzz: 3 component models"

nickreich commented 1 year ago

Should we have an argument that specifies a minimum number of component models needed for a task-id combination?

elray1 commented 1 year ago

I would suggest to have separate functions that do forecast validation/subsetting and ensembling calculations:

These are kind of distinct steps in an ensembling workflow. For example, people may want to make use of similar validation and subsetting operations but then apply different ensembling methods
Validation/subsetting operations can be "non-commutative" in surprising ways. For example, you can get different results if you first drop model/task id combinations without all quantiles and then drop model/task id combinations without all horizons vs if you do those filter operations in the opposite order. For example, suppose a model is missing one quantile level at horizons 3 and 4, but otherwise has complete forecasts. If we drop model/location/horizons without all quantile levels and then model/locations without all horizons, we'll lose that model. If we drop model/locations without all horizons and then model/location/horizons without all quantiles, we'll still have that model's forecasts at horizons 1 and 2. This can be non-transparent if the logic is hidden, and it might be better to force ensemblers to grapple with this directly, and also have the flexibility to specify the order they want.
A single function to do it all is the path we went down for covidEnsembles and I regretted it. Check out the number of arguments to the build_covid_ensemble function. Surely this could be streamlined with a better design for arguments, but I think a better way to streamline it is just to have the separate functions.
Note that we had previously outlined some thoughts about what the validation operations might look like (at a very high level) in a section called "Thoughts on data assembly, validation, and filtering" in this google doc. [That discussion is pretty old and likely needs to be updated, just pointing to it in case it's helpful.]

elray1 commented 1 year ago

r.e. the above discussion about whether or not this should look more like hubEnsembles -- thought I'd just note that that's the approach I've taken in the qens package.

The qens function acts much like lm, takes some training data and returns an S3 object of class <qens>
Then there is a predict.qens function that can be called to generate predictions.

I'm still personally waffling about whether or not I think that's the right way to go for the hubEnsembles package, but thought I'd point to related work in case it's helpful.

elray1 commented 1 year ago

OK, here's a proposal for how we can move forward, at least temporarily resolving the discussion/questions above: for now, let's start with something basically along the lines of what I initially wrote up. Perhaps we call it something like weighted_summary_ens or wtd_summary_ensemble or wtd_summary_ens or summary_ens (open to other suggestions for the name -- I don't love any of these -- is there a better word than summary that still captures the generic specification we've laid out?). This function essentially plays a role similar to that of weighted.mean or weighted.median, but with data frames of component forecasts and weights as inputs instead of numeric vectors.

Potentially later on, we'll write one or more functions that estimate weights. When we do that, it may be appropriate for such a function to return an S3 object with an associated predict method that is essentially a wrapper around the above [weighted] summary ensemble function.

elray1 commented 1 year ago

closed via #4

hubverse-org / hubEnsembles

Method to ensemble via direct summarization of component predictions #1