elray1 commented 1 year ago

We pretty much settled on the desired functionality here in discussion on issue #20, but I'm splitting implementation into a separate issue.

elray1 commented 1 year ago

More detailed ideas about handling samples. See comments here and below where we have talked about this before.

Previous decision: This function will not do any validations related to potentially different dependence structures represented by the component models. If a hub cares about enforcing that models use the same dependence structure for their samples, this will be specified in the hub's config, and will be checked at the time that model outputs are submitted to the hub, so we don't need to do validations related to this in this function.

There are two cases.

Case 1: collect component samples

If all three of the conditions in points below are satisfied, this function simply collects the samples from the component models and updates the sample indices to ensure that they are different for different component models:

equal weights for all models,
the same number of samples from each component model
no limit on the number of samples the ensemble is allowed to produce

Case 2: do some sampling

However, if any of those conditions are not satisfied, we have to do something else:

Draw a sample of the specified size from the collection of all component model samples. We should try to do this in a way that minimized the amount of extra Monte Carlo variability that is introduced by sampling:
- Ensure that each model is represented in the output according to its model weight. That is, for each component model we should get a number of samples that is (approximately) equal to the model weight times the desired number of ensemble samples. If there is a remainder, that can be distributed among the models at random.
- To get samples from a component model, two steps:
  - If the number of samples we want to get from a model (target_n_component_samples) is larger than the number of samples that model provided (provided_n_component_samples), duplicate/replicate each of the samples from that model floor(target_n_component_samples / provided_n_component_samples) times. For example, if we want to get 25 samples from model A and model A provided 10 samples, floor(target_n_component_samples / provided_n_component_samples) = 2 and after this step we will have 2 copies of each of the 10 samples provided by that model.
  - Sample without replacement for the remainder. e.g. in this example there will be 5 more samples to obtain for this model, and we choose 5 distinct samples provided by that model, at random without replacement.

Notes about a new function argument related to desired ensemble sample size

The sampling step in case 2 requires the user to specify how many ensemble samples they want in the output. So we need to add an argument to linear_pool allowing the user to specify this, e.g. n. But we want to allow for n = NULL, to say that if possible, the function should just collect the component model samples as in Case 1 above. Let's set n = NULL as the default, but then throw an error if we end up in case 2 and the user did not provide an integer value of n.

We can have 3 separate validations related to this:

n must either be NULL or coercible to an integer
If weights are provided, n must be an integer. Error text: "Component model weights were provided, so a number of ensemble samples n must be provided."
If component models provided a different number of samples within any group defined by a combination of task ids, n must be an integer. Error text: "Component models provided differing numbers of samples within at least one forecast task id group, so a number of ensemble samples n must be provided."

elray1 commented 1 year ago

Couple of additional thoughts about this:

We don't really have support for samples built into the schemas or hubUtils yet. That means that right now, there’s no formal way to tell whether the output_type_id for samples should be an integer or a character. One option for this could be to just convert it to an integer, and then check the data type of the output_type_id column and convert to that. Or we could for now provide an argument to the linear_pool function to say what data type to use for sample output ids. Or maybe we should just hold off on building this functionality until we have support for it in the other tools?
One thing we need to do is ensure that the output type ids are distinct for samples from different component models. Probably the simplest way to do that would be to just paste the input/component model_id together with the output_type_id provided by that model. and then we can do the as.integer(factor(…)) trick to convert these to distinct integers.

elray1 commented 4 months ago

I'm closing this issue in favor of #109 and #110

hubverse-org / hubEnsembles

handle sample types in `linear_pool` #27

Case 1: collect component samples

Case 2: do some sampling

Notes about a new function argument related to desired ensemble sample size