Reconciliation of forecasts in stretched crossvalidation

henningsway commented 3 years ago

I've recently been working regularly with fable and the package has been a joy to work with, thank you!

I currently would like to use tsibble::stretch_tsibble (is there an alternative? I'm wondering about the "questioning" lifecycle tag) to evaluate reconciled forecasts. It seems to be working for sliding windows, but for stretched tsibbles I run into an error.

Please find a reproducable example below

library(tidyverse)
library(tsibble)
library(fable)
#> Lade nötiges Paket: fabletools

tourism_hts <- tourism %>%
  aggregate_key(State / Region, Trips = sum(Trips))

# reconciliation with sliding window - works
fc_slided <- tourism_hts %>% 
  filter(State == "Tasmania") %>% 
  slide_tsibble(.step = 8, .size = 60) %>%
  model(ets = ETS(Trips)) %>%
  reconcile(ets_rec = min_trace(ets)) %>%
  forecast(h = 4)

# reconciliation with sliding window - doesn't work
fc_slided <- tourism_hts %>% 
  filter(State == "Tasmania") %>% 
  stretch_tsibble(.step = 8, .init = 60) %>%
  model(ets = ETS(Trips)) %>%
  reconcile(ets_rec = min_trace(ets)) %>%
  forecast(h = 4)
#> Error: Problem with `mutate()` input `ets_rec`.
#> x Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl für Funktion 'as.matrix': Join columns must be present in data.
#> x Problem with `date`.
#> i Input `ets_rec` is `(function (object, ...) ...`.

^{Created on 2021-02-03 by the reprex package (v0.3.0)}

mitchelloharawild commented 3 years ago

This specific error was fixed in 683e8a9550105a85b087d6d0f999e081ceb53fcc, however reconciling cross validated forecasts is not yet possible.

This is because the key variable used to identify the cross validation fold becomes part of the hierarchy. As there is no <aggregated> value for these folds (which is appropriate), this produces 'disjoint' hierarchies (where each branch - or fold - should be reconciled separately).

The relevant issue for this is here: #106

claudiolaas commented 3 years ago

Hi Mitchell, I am working with @henningsway on this and we thought of a workaround: iterate over the chunks that stretch_tsibble or slide_tsibble generate and do the model-reconciliate-accuracy step on each chunk individually. Then average the error metrics over all chunks to get the overall error metric.

However, not all error metrics came out accurately. Some examples: ME, MAPE and CRPS did average out to the correct overall value but MASE and RMSSE did not. We suppose that one would have to average over(?) the residuals of the chunks and then calculate the overall error metrics instead of calculating the error metrics for each chunk and then do the averaging.

Or in other words, how exactly do the forecasts of a stretched tsibble get combined to arrive at one overall accuracy measure?

mitchelloharawild commented 3 years ago

Could you elaborate on why you think the MASE and RMSSE error metrics are not accurate? Perhaps there is a problem or confusion about the scaling of these accuracy measures.

When forecasting a stretched tsibble, you will get separate forecasts for each fold of the tsibble. From there, you can compute a set of accuracy() measures for the forecast errors using the test set. Typically these accuracy measures would be summarised into a single value (across the folds of the stretched tsibble) using a mean or median.

claudiolaas commented 3 years ago

Typically these accuracy measures would be summarised into a single value (across the folds of the stretched tsibble) using a mean or median.

This is exactly what we tried, but it appears that by using stretch or slide some values get averaged out differently, see example below.

# just one time series
test_data <- tourism %>%
  filter(Region == "Adelaide",
         State == "South Australia",
         Purpose == "Business")

#create two non overlapping chunks of 39 rows each
fc_slide <- test_data %>% 
  slide_tsibble(.step = 39, .size = 39) %>% 
  model(ets = ETS(Trips)) %>% 
  forecast(h = 1)  %>% 
  accuracy(test_data)#,measures = list(distribution_accuracy_measures))

# first 39 rows
fc_1 <- test_data %>%
  filter(Quarter < yearquarter("2007 Q4")) %>% 
  model(ets = ETS(Trips)) %>%
  forecast(h = 1) %>% 
  accuracy(test_data)#,measures = list(distribution_accuracy_measures))

# second 39 rows
fc_2 <- test_data %>%
  filter(Quarter >= yearquarter("2007 Q4"),
         Quarter <= yearquarter("2017 Q2")) %>% 
  model(ets = ETS(Trips)) %>%
  forecast(h = 1) %>% 
  accuracy(test_data)#,measures = list(distribution_accuracy_measures))

(fc_1$ME + fc_2$ME)/2 == fc_slide$ME # --> True

(fc_1$RMSSE + fc_2$RMSSE)/2 == fc_slide$RMSSE # --> False

mitchelloharawild commented 3 years ago

@robjhyndman I think I asked you about this before, but I couldn't find the answer. When computing scaled accuracy measures over folds of a cross-validated dataset, is it more appropriate to use the same scaling factor or a scaling factor specific to each fold?

robjhyndman commented 3 years ago

I would use the same scaling factor computed over the whole data set. Otherwise it just adds another source of variability.

mitchelloharawild commented 3 years ago

Closing as scaling factor used in accuracy() is more appropriate, and reconciliation of cross-validated models will be added in #106.

tidyverts / fabletools

Reconciliation of forecasts in stretched crossvalidation #305