clarify comments about differences between take-aways for MAE and WIS

elray1 commented 3 months ago

in section 5 we write "All of the ensemble models tend to have similar MAE values during the entire evaluation time period, with slight divergence in MAE values for certain weeks at the four week ahead horizon (@fig-mae-vs-forecast-date). However, the models show greater differences for the other two metrics, WIS and coverage, particularly during times of rapid change in the observed incident hospitalizations (@fig-wis-vs-forecast-date and @fig-cov95-vs-forecast-date)."

I don't see big qualitative differences in what shows up in the results for MAE and WIS though.

lshandross commented 3 months ago

While the results are similar between all three metrics, I wanted to get across that the median ensemble is better than the other models by a greater magnitude for wis and coverage compared to for mae. This can be seen by comparing figures 7-9 empirically, and is also supported by the overall results shown in table 10

elray1 commented 2 months ago

After a little more thought, I agree with the statement that the median ensemble is better than the linear pools by a greater magnitude for WIS and coverage than for MAE -- but I don't see the same thing for the median vs mean comparison, which I think is the comparison that drew my eye when I first looked at the results. In more detail (probably too much detail to include in the write up), here is what I see. (I'll put some supporting tables for the median vs mean statements below.)

median vs mean:

The median model outperforms the mean model by a similar amount according to both MAE and WIS
Patterns in when the median model did better than the mean model seem to be similar for MAE and WIS, and are localized to times of change (or change in trends?)
The median model has better coverage rates than the mean model in the tails of the distribution (95% intervals), and similar coverage in the center of the distribution (50% intervals)

median vs linear pools:

According to MAE, the median model outperforms the linear pools by only a small margin. Looking week by week, the MAE for the median ensemble and the linear pools are often very similar. The largest difference is just after the peak in the 2022/23 season.
The median model outperforms the linear pool models by a larger amount according to WIS and coverage rates than it does according to MAE. The general indication is that the linear pools are too conservative, with coverage rates that are too high. In nearly all cases, these wide predictive distributions are penalized by WIS -- according to WIS, the median model outperformed the linear pools at horizon 4 consistently, throughout all times where there was substantive influenza activity. In the 2022/23 season, there were some localized times where the one-week-ahead forecasts from the linear pools outperformed the one-week-ahead forecasts from the median ensemble. These were times where the models had similar MAEs, but the median ensemble had poor coverage rates. In these instances, the wide intervals from the linear pools were helpful in capturing the eventually observed data.

If you agree with these statements, maybe we can sum up this slightly more nuanced take on the situation in ~2-3 sentences?

Supporting tables

Here are the results from table 10 in matrix form:

results <- data.frame(
  wis = c(18.158, 19.745, 19.747, 20.18, 22.876),
  mae = c(27.36, 27.932, 27.933, 29.582, 29.315),
  row.names = c("median", "lp-normal", "lp-lognormal", "mean", "baseline")
) |>
  as.matrix()
results

                wis    mae
median       18.158 27.360
lp-normal    19.745 27.932
lp-lognormal 19.747 27.933
mean         20.180 29.582
baseline     22.876 29.315

We can reproduce the rel. wis and rel. mae values in table 10 with the baseline as the reference as follows:

sweep(results, 2, results["baseline", ], `/`)

                   wis       mae
median       0.7937576 0.9333106
lp-normal    0.8631317 0.9528228
lp-lognormal 0.8632191 0.9528569
mean         0.8821472 1.0091080
baseline     1.0000000 1.0000000

However, this makes a direct comparison of the relative performance of ensemble methods a little difficult (e.g., to directly compare median and mean, we still need to look at something like the ratio of rel. wis values for those models). Here are two other ways of processing that table.

What if we look at degradation of performance of other methods relative to the median? First, using division:

sweep(results, 2, results["median", ], `/`)

                  wis      mae
median       1.000000 1.000000
lp-normal    1.087399 1.020906
lp-lognormal 1.087510 1.020943
mean         1.111356 1.081213
baseline     1.259830 1.071455

The same comparison, but taking differences of mean scores rather than ratios (taking ratios seems to be an established method for comparing scores, but I'm not sure if there's a formal or theoretical reason for that):

sweep(results, 2, results["median", ], `-`)

               wis   mae
median       0.000 0.000
lp-normal    1.587 0.572
lp-lognormal 1.589 0.573
mean         2.022 2.222
baseline     4.718 1.955

In these last two tables, the difference in outcomes for median vs mean seems fairly similar for WIS and MAE. If we take ratios, the magnitude of difference seems larger according to WIS than it does according to MAE, but it's the other way around if we take differences. In both cases, I'm not sure whether the "difference in differences" is meaningful (Is 1.111 meaningfully different from 1.081? Is 2.022 meaningfully different from 2.222?).

lshandross commented 2 months ago

Hi Evan, thanks for your thoughtful investigation into this. I fully agree with your analysis and like the idea of revising the original paragraph to include the 2-3 sentence more nuanced take on the results you've shown here.

lshandross commented 2 months ago

Here's what I've come up with as a summary of the results you wrote up above (without any references to figures, but I'll add them later in the manuscript):

Generally, the median model can be seen to have the best scores for every metric. It outperforms the mean ensemble by a similar amount for both MAE and WIS, particularly around local times of change. The median ensemble also has better coverage rates in the tails of the distribution (95% intervals) and similar coverage in the center (50% intervals). The median model also outperforms the linear pools for most weeks, with the greatest differences in magnitude being for WIS and coverage rates. This seems to indicate that the linear pools' estimates are usually too conservative, with their high coverage rates being penalized by WIS. While during the 2022-2023 season there are several localized times when the linear pools showcase better one-week-ahead forecasts than the median ensemble, these localized instances are characterized by similar MAE values and poor median ensemble coverage rates. In these instances, the wide intervals from the linear pools were useful in capturing the eventually-observed hospitalizations, usually during times of rapid change.

What do you think @elray1 ?

elray1 commented 2 months ago

Thanks for putting this together, Li, I think it looks very good. I've made some minor suggestions in the text below (my suggestions for added text in bold):

Averaging across all time points, the median model can be seen to have the best scores for every metric. It outperforms the mean ensemble by a similar amount for both MAE and WIS, particularly around local times of change. The median ensemble also has better coverage rates than the mean ensemble in the tails of the distribution (95% intervals) and similar coverage in the center (50% intervals). The median model also outperforms the linear pools for most weeks, with the greatest differences in scores being for WIS and coverage rates. This seems to indicate that the linear pools' estimates are usually too conservative, with their wide intervals and higher-than-nominal coverage rates being penalized by WIS. However, during the 2022-2023 season there are several localized times when the linear pools showcased better one-week-ahead forecasts than the median ensemble. These localized instances are characterized by similar MAE values for the two methods and poor median ensemble coverage rates. In these instances, the wide intervals from the linear pools were useful in capturing the eventually-observed hospitalizations, usually during times of rapid change.

hubverse-org / hubEnsemblesManuscript

clarify comments about differences between take-aways for MAE and WIS #62

Supporting tables