hubverse-org / hubEnsembles

Ensemble methods for combining hub model outputs.
https://hubverse-org.github.io/hubEnsembles/
Other
6 stars 2 forks source link

`simple_ensemble` quantile crossing due to `weightedMedian` calculation issue #122

Closed lshandross closed 1 month ago

lshandross commented 1 month ago

When calculating a weighted median ensemble using simple_ensemble(), quantile crossing may occur due to this documented issue in matrixStats::weightedMedian(), which is called within simple_ensemble().

The discussion on the issue indicates that when there are duplicate x values, an unstable sort function may result in the weight values being flipped for the duplicate x values. This causes an issue when interpolate=TRUE, which is the default behavior for weightedMedian(), and produces an incorrect weighted median.

This is demonstrated by the following data, which should produce a weighted median of zero but actually produces a value of 0.078

id x w
1 -0.103 0.08
2 -0.89 0.14
3 0 0.22
4 0 0.12
5 0.039 0.28
6 0.055 0.16
id=1:6
x=c(-0.103, -0.89, 0, 0, 0.039, 0.055)
w=c(0.08, 0.14, 0.22, 0.12, 0.28, 0.16)

# produces incorrect value of 0.078
weightedMedian(x, w, ties=NULL) # default behavior

# produces correct value of 0
weightedMedian(x, w, ties=NULL, interpolate=FALSE)
weightedMedian(x, w, ties="weighted"); weightedMedian(x, w, ties="mean")
weightedMedian(x, w, ties="max"); weightedMedian(x, w, ties="min")

Below, a reproducible example for the same issue occurring with simple_ensemble is given based on the data set above:

toy_outputs <- data.frame(
    model_id=letters[1:6], 
    output_type = rep("quantile", 6), 
    output_type_id = rep(0.5, 6), 
    value = c(-0.103, -0.89, 0, 0, 0.039, 0.055)
)
weights <- data.frame(
    model_id=letters[1:6], 
    weight = c(0.08, 0.14, 0.22, 0.12, 0.28, 0.16)
)

simple_ensemble(toy_outputs, weights, agg_fun = "median")
# A tibble: 1 x 4
# model_id      output_type  output_type_id    value
# <chr>         <chr>                 <dbl>    <dbl>
# hub-ensemble  quantile                0.5  0.00780

My proposal is to fix this issue in simple_ensemble by adding "interpolate = FALSE" to agg_args when the user asks for a weighted median

elray1 commented 1 month ago

Thanks for investigating, Li! I support setting interpolate = FALSE, which will also result in a more conventionally-understandable weighted median calculation.