`linear_pool` method - Githubissues

elray1 commented 1 year ago

We would like to have a linear opinion pool method. Here's a proposed function signature:

#' Compute ensemble model outputs as a linear pool, otherwise known as a
#' distributional mixture, of component model outputs for
#' each combination of model task, output type, and output type id. Supported
#' output types include `mean`, `quantile`, `cdf`, `pmf`, and `sample`.
#'
#' @param model_outputs an object of class `model_output_df` with component
#'   model outputs (e.g., predictions).
#' @param weights an optional `data.frame` with component model weights. If
#'   provided, it should have a column named `model_id` and a column containing
#'   model weights. Optionally, it may contain additional columns corresponding
#'   to task id variables, `output_type`, or `output_type_id`, if weights are
#'   specific to values of those variables. The default is `NULL`, in which case
#'   an equally-weighted ensemble is calculated.
#' @param weights_col_name `character` string naming the column in `weights`
#'   with model weights. Defaults to `"weight"`
#' @param model_id `character` string with the identifier to use for the
#'   ensemble model.
#' @param task_id_cols `character` vector with names of columns in
#'   `model_outputs` that specify modeling tasks. Defaults to `NULL`, in which
#'   case all columns in `model_outputs` other than `"model_id"`, the specified
#'   `output_type_col` and `output_type_id_col`, and `"value"` are used as task
#'   ids.
#' @param output_type_col `character` string with the name of the column in
#'   `model_outputs` that contains the output type.
#' @param output_type_id_col `character` string with the name of the column in
#'   `model_outputs` that contains the output type id.
#' @param ... parameters that are passed to `distfromq::make_r_fun`, specifying
#'   details of how to estimate a quantile function from provided quantile levels and
#'.  quantile values.
#'
#' @return a data.frame with columns `model_id`, one column for
#'   each task id variable, `output_type`, `output_id`, and `value`. Note that
#'   any additional columns in the input `model_outputs` are dropped.
#'
#' @export
linear_pool <- function(model_outputs, weights = NULL,
                            weights_col_name = "weight",
                            model_id = "hub-ensemble",
                            task_id_cols = NULL,
                            output_type_col = "output_type",
                            output_type_id_col = "output_type_id") {

The operation will vary by the output_type:

For output_types mean, cdf, pmf, we can call hubEnsembles::simple_ensemble directly using a (weighted) mean. Note: in the documentation, we should describe the motivation for this choice for the mean output_type: the mean of a distributional mixture is the mean of the component distributions.
For output_type sample, the ensemble should collect the samples from all individual models. The output_type_id column values, containing sample indices, should be updated to ensure that samples from different individual models are given distinct sample indices. Note that the simple suggestion here only works if we have the same number of samples from each component model and the component models have equal weight. Otherwise, we would have to somehow represent weights for these samples.
For output_type quantile, The basic idea is to get an estimate of the cdf and use that. There are two reference implementations of this idea out there:
- The LOP method in @eahowerton 's CombineDistributions package: https://github.com/eahowerton/CombineDistributions/blob/main/R/LOP.R
- The distfromq package has functionality to estimate a cdf or quantile function from provided quantiles, and to generate samples from that distribution. See the vignette here. There is an example of using this for ensemble calculations here.

elray1 commented 1 year ago

We'd like to think about how we might incorporate trimming here. Think about trimming either the most extreme models at each point on the horizontal axis, or the most extreme models at each point on the vertical axis. Would like to have something that's consistent across the different output types, or a way to specify which kind of trimming to do that validates that the combination of output type and trimming type is supported.

From Emily: here's the paper where trimming is proposed, and my paper on LOP vs. Vincenti:

elray1 commented 1 year ago

Noting that after discussion with Emily today, we decided to build around distfromq for quantile interpolation, possibly adding functionality to distfromq for simpler, faster methods if the spline-based method is too slow.

elray1 commented 11 months ago

I'm going to record some ongoing questions about how to handle samples in the linear pool ensemble function.

First, note that in the simplest case where we have (1) equal weights for all models, (2) the same number of samples from each component model, (3) no limit on the number of samples the ensemble is allowed to produce, and (4) no desire to enforce any particular (or consistent) dependence structure on the ensemble samples, the ensembling operation is straightforward: we simply collect all of the samples from component models and update the sample index (specified by the output_type_id) to be distinct for samples from different component models. However, the operation is more challenging if any of these simplifying conditions are not satisfied. I'll describe considerations for each in the points below:

unequally-weighted models: If the models have unequal weights, I see two ways forward:
1. Take a weighted sample from the component model samples. The problem with this is that it introduces extra Monte Carlo noise/variability. If we do this, should it just be a random sample, or is there a more systematic way to do it?
2. Keep track of the model weights along with the samples, so that we have a weighted sample. The problem with this is that it requires extra data structure, but in a statistical sense it seems like the right thing to do. Estimates of quantities derived from the samples would need to be adjusted in the natural way to account for the sample weights, but the resulting estimators would have lower variance than ones based on a weighted resample as in point 1, since we would have avoided injecting extra Monte Carlo variability.
differing number of samples from component models: If the component models have equal weight but provide different numbers of samples, we effectively need to give the samples from those models different weight, and then we're back at the considerations under point 1 above.
limit on the number of samples the ensemble is allowed to produce: Suppose that a hub enforces a limit of 1,000 samples per model, and an ensemble combines predictions from 10 models. If each component model submits 1,000 samples, the naive aggregation strategy will result in an ensemble with 10,000 samples. A hub may decide that this is OK for ensemble models and allow that submission anyways, but I could also see a hub wanting to avoid very large file sizes. In this case, the only way forward I can think of is to take a subsample of the samples provided by each component model. Should this just be done at random, or is there a more systematic way forward?
desire to enforce a consistent dependence structure: Suppose we want our ensemble's samples to represent "trajectories", i.e. we want the samples to be draws from a joint distribution across forecast horizon within each combination of other task id variables like location. If each component model produces samples that capture at least that level of dependence, we are good to go, but if some of the component models provided samples from a separate marginal distribution at each horizon, there will be a problem. As a simple solution, we could just throw those samples out. As another alternative, we could potentially apply some approach like a Schaake shuffle to try to recover the desired dependence structure.

My basic question is: how many of these considerations should we address in this function? I think that a first vote is that 4 seems out of scope for this function, but we might want to do something about 1, 2, and 3?

sbfnk commented 11 months ago

the approach we took a while ago was to take a random weighted sample (your (i)). Your point on extra noise is well made but it to me it seems a small price to pay given that we'll end up with something that we can just use like any other set of predictive samples. By the way feel free to re-use any of the code there in that function if at all useful (not that it needs saying given that it's under MIT license).
Do we really always want to weigh them differently? another option would be resampling with replacement, or to sample from each model to get the minimum number of samples that are available for models
To me this is a problem that's outside the scope of a the function you describe - the natural thing is to throw all the samples together. If any additional subsampling is to be done for storage reasons that seems an external task.
I agree with you that if the samples represent different distributions it shouldn't be for this function to resolve this

elray1 commented 11 months ago

I'm on board with all of Seb's suggestions here (noting for 2 that I think of resampling with replacement as a strategy for dealing with weighting as in 1.i.)

eahowerton commented 11 months ago

I agree with everything that has been stated so far. A few additional comments about how I've seen samples used/presented (most relevant to (3)).

US SMH doesn't present a sample "ensemble". In other words, when we visualize the samples, we plot all models individually and do not show an ensemble. I am wondering how much value an ensemble function adds if it simply collects all the individual samples, as this data is already available (but perhaps I'm missing something in the bigger picture that would make this important).
It seems we are quite often interested in generating an ensemble summarized by quantiles from the samples. Are we considering functionality that would allow a user to have input data with output_type = sample and an ensemble with output_type = quantile, for example? This would presumably be a simple (likely separate?) function, or should we leave this to the user?

elray1 commented 11 months ago

good points, Emily. Some quick thoughts, essentially agreeing with you:

Agree that I'm not sure how much value ensembles of samples really provides, except as maybe a convenient intermediate step along the way to summaries of those samples? e.g. as suggested in your second point if we want to get to ensemble quantiles, a reasonable workflow may be to collect the ensemble samples and then summarize them? it may also be nice for a hub to commit a file with ensemble samples as a formal record of what samples it had available as of a particular time?
agree that it seems easiest to just make that a separate function -- we have filed related issues in hubUtils here and here

elray1 commented 11 months ago

To make it so that we can merge in PR Infectious-Disease-Modeling-Hubs/hubUtils#26 soon with some good working functionality, I'm proposing that we move two pieces of functionality we've discussed for this function out to separate issues: handling of samples, and trimming. I have filed the separate issues Infectious-Disease-Modeling-Hubs/hubUtils#27 and Infectious-Disease-Modeling-Hubs/hubUtils#28 for these things.

hubverse-org / hubEnsembles

`linear_pool` method #20