Closed henningsway closed 3 years ago
This specific error was fixed in 683e8a9550105a85b087d6d0f999e081ceb53fcc, however reconciling cross validated forecasts is not yet possible.
This is because the key variable used to identify the cross validation fold becomes part of the hierarchy. As there is no <aggregated>
value for these folds (which is appropriate), this produces 'disjoint' hierarchies (where each branch - or fold - should be reconciled separately).
The relevant issue for this is here: #106
Hi Mitchell, I am working with @henningsway on this and we thought of a workaround: iterate over the chunks that stretch_tsibble or slide_tsibble generate and do the model-reconciliate-accuracy step on each chunk individually. Then average the error metrics over all chunks to get the overall error metric.
However, not all error metrics came out accurately. Some examples: ME, MAPE and CRPS did average out to the correct overall value but MASE and RMSSE did not. We suppose that one would have to average over(?) the residuals of the chunks and then calculate the overall error metrics instead of calculating the error metrics for each chunk and then do the averaging.
Or in other words, how exactly do the forecasts of a stretched tsibble get combined to arrive at one overall accuracy measure?
Could you elaborate on why you think the MASE and RMSSE error metrics are not accurate? Perhaps there is a problem or confusion about the scaling of these accuracy measures.
When forecasting a stretched tsibble, you will get separate forecasts for each fold of the tsibble. From there, you can compute a set of accuracy()
measures for the forecast errors using the test set. Typically these accuracy measures would be summarised into a single value (across the folds of the stretched tsibble) using a mean or median.
Typically these accuracy measures would be summarised into a single value (across the folds of the stretched tsibble) using a mean or median.
This is exactly what we tried, but it appears that by using stretch or slide some values get averaged out differently, see example below.
# just one time series
test_data <- tourism %>%
filter(Region == "Adelaide",
State == "South Australia",
Purpose == "Business")
#create two non overlapping chunks of 39 rows each
fc_slide <- test_data %>%
slide_tsibble(.step = 39, .size = 39) %>%
model(ets = ETS(Trips)) %>%
forecast(h = 1) %>%
accuracy(test_data)#,measures = list(distribution_accuracy_measures))
# first 39 rows
fc_1 <- test_data %>%
filter(Quarter < yearquarter("2007 Q4")) %>%
model(ets = ETS(Trips)) %>%
forecast(h = 1) %>%
accuracy(test_data)#,measures = list(distribution_accuracy_measures))
# second 39 rows
fc_2 <- test_data %>%
filter(Quarter >= yearquarter("2007 Q4"),
Quarter <= yearquarter("2017 Q2")) %>%
model(ets = ETS(Trips)) %>%
forecast(h = 1) %>%
accuracy(test_data)#,measures = list(distribution_accuracy_measures))
(fc_1$ME + fc_2$ME)/2 == fc_slide$ME # --> True
(fc_1$RMSSE + fc_2$RMSSE)/2 == fc_slide$RMSSE # --> False
@robjhyndman I think I asked you about this before, but I couldn't find the answer. When computing scaled accuracy measures over folds of a cross-validated dataset, is it more appropriate to use the same scaling factor or a scaling factor specific to each fold?
I would use the same scaling factor computed over the whole data set. Otherwise it just adds another source of variability.
Closing as scaling factor used in accuracy()
is more appropriate, and reconciliation of cross-validated models will be added in #106.
I've recently been working regularly with fable and the package has been a joy to work with, thank you!
I currently would like to use
tsibble::stretch_tsibble
(is there an alternative? I'm wondering about the "questioning" lifecycle tag) to evaluate reconciled forecasts. It seems to be working for sliding windows, but for stretched tsibbles I run into an error.Please find a reproducable example below
Created on 2021-02-03 by the reprex package (v0.3.0)