evaluate initial model performance

vpnagraj commented 3 years ago

in the long run we need to establish a way to rigorously assess model performance relative to submissions by other participants in the covid 19 forecast hub

in the short term, we need to put together an evaluation framework (which may provide the starting point for the above?) to compare candidates for our initial model to be implemented

as part of this task we'll probably want to set a horizon back in time (say, n weeks prior to current week) and evaluate our n week ahead model forecasts against "ground truth" data for the target(s) on those weeks. we'll also need to pull forecast data from other participants and filter for the historical horizon, then compare the absolute difference of those forecasts against the ground truth data. from there, we can rank our model(s) against each other and the field.

no doubt there are other ways to approach this. open to suggestions ...

stephenturner commented 3 years ago

There's a bit on evaluating model accuracy in the FPP3 book, https://otexts.com/fpp3/evaluation.html

With an example of the accuracy() function for evaluating time series forecasts at http://fable.tidyverts.org/articles/fable.html.

I've used MAPE in some mooc I did on forecasting inventory/pricing. MASE was new to me. https://en.wikipedia.org/wiki/Mean_absolute_scaled_error

But yeah, agreed with your general approach, similar to what we've been doing visually in our "scratch" scripts - going back in time four weeks, forecasting, comparing to actual truth data. "Comparing" heretofore has just been visual. We could formalize this.

vpnagraj commented 3 years ago

@chulmelowe i'm bumping this your direction.

some musings below.

a couple of ways to think about accuracy:

we can look at training/testing data to evaluate individual model performance
we can look back and compute forecasts on horizons that have already occurred ... then compare forecast result to "ground truth" and/or other forecasts

the first step will help us pick a modeling strategy for submission. the second step will help us communicate how well that modeling strategy performs.

for the first point, we can look at accuracy metrics like RMSE (some exploratory code here: https://github.com/signaturescience/focustools/blob/master/scratch/fable-accuracy-scratch.R#L53-L88)

for the second point, i outlined one idea above:

as part of this task we'll probably want to set a horizon back in time (say, n weeks prior to current week) and evaluate our n week ahead model forecasts against "ground truth" data for the target(s) on those weeks. we'll also need to pull forecast data from other participants and filter for the historical horizon, then compare the absolute difference of those forecasts against the ground truth data. from there, we can rank our model(s) against each other and the field.

lots of other ideas to explore with overall forecast accuracy ... there's literature out there related to probabilistic forecast scoring:

Reich, N. G., Osthus, D., Ray, E. L., Yamana, T. K., Biggerstaff, M., Johansson, M. A., Rosenfeld, R., & Shaman, J. (2019). Reply to Bracher: Scoring probabilistic forecasts to maximize public health interpretability. Proceedings of the National Academy of Sciences of the United States of America, 116(42), 20811–20812. https://doi.org/10.1073/pnas.1912694116

Bracher J. (2019). On the multibin logarithmic score used in the FluSight competitions. Proceed- ings of the National Academy of Sciences of the United States of America, 116(42), 20809–20810. https://doi.org/10.1073/pnas.1912147116

the weekly C19FH reports (example: https://covid19forecasthub.org/reports/2021-01-05-weekly-report.html) evaluates absolute difference between predicted and observed we could do something similar for each submitted forecast (that overlaps with our targets) and rank performers on absolute accuracy?

like i said ... lots of ways to slice this!

vpnagraj commented 3 years ago

@chulmelowe not trying to rush this by any means, but my curiosity got the better of me and i started poking around with some code for ranking our model vs the other teams:

https://github.com/signaturescience/focustools/commit/9a5509c715bca87a83ff4a7a4ba92565ea660791

interesting results for our rank relative to all submitters in terms of absolute difference between observed and predicted:

horizon  target         text
       1 cdeaths 35 out of 45
       1  icases 22 out of 36
       1 ideaths 40 out of 45
       2 cdeaths 10 out of 44
       2  icases 22 out of 35
       2 ideaths  7 out of 44
       3 cdeaths  3 out of 44
       3  icases 19 out of 34
       3 ideaths  1 out of 43
       4 cdeaths 15 out of 43
       4  icases  6 out of 32
       4 ideaths 27 out of 42

a visual of the same:

rank

a few impressions:

obviously there's heterogeneity in performance by target
also heterogeneity in how many teams submit forecasts for each target
our performance / ranking will vary week-to-week as we continue to submit forecasts
is there a way to incorporate quantile estimates as opposed to just point estimates?
is underestimating "worse" than overestimating? vice versa? or do we just consider absolute difference as we do here

lots more to consider. let's keep this thread active as we chew through the best way to capture our performance relative to other forecasts.

stephenturner commented 3 years ago

Slick. First place ideaths week 3 👍 . Not bad for most others. This gets at what IRAD committee wanted to see, so let's show them on the next report. I look at this viz, and I get it, but we could do better I think. I can't think of how at the moment, but we can continue to work on that.

our performance / ranking will vary week-to-week as we continue to submit forecasts

Maybe overthinking here.. if we do this kind of thing in a sliding window going backward one week at a time and we find that we consistently over or under estimate a particular target at a particular horizon, might we consider tracking that consistent over/underperformance, get that offset's moving average, and apply that "penalty" to our final modeled results?

is there a way to incorporate quantile estimates as opposed to just point estimates?

Surely there's a way, lessons from flusight, or the quantile score section in the fpp3 book

is underestimating "worse" than overestimating? vice versa? or do we just consider absolute difference as we do here

🤷 ... I could make a case either way, but not strong enough to justify moving away from absolute difference. squaring the difference would keep them all positive, and would penalize worse differences more, but wouldn't change rankings unless we did something like summing over weeks.

vpnagraj commented 3 years ago

i think we're good to close out this issue. after all, we're no longer evaluating initial model performance.

we can manage more specific tasks (i.e. incorporation of the WIS) elsewhere as needed.

signaturescience / focustools

evaluate initial model performance #5