Open robjhyndman opened 2 years ago
Bumping this. I need the accuracy function for something I'm writing.
This unexpected behaviour is due to how the scaling term for MASE
is computed.
The by
argument is used to group up the data passed to the accuracy functions, and when by = ".model"
is used, the 'grouped' training data for MASE
doesn't make much sense (it combines the training data of each Origin
).
It's not clear to me how best to handle this for MASE, and other measures which might use information that doesn't match the indices of the forecasts (such as training data). For your expected results, you would want to pass multiple scaling terms (one for each series) to MASE()
, but I think this verges on doing something too complicated in a function that may be surprising in other situations.
Another related consideration is what is the correct MASE
scaling factor if you're grouping by
a season (say accuracy by each quarter of the year). Would the correct scaling factor for MASE be based on the training data of each quarter? If the seasonality induces substantially different scales for each quarter, I think that could be reasonable.
So if anything, I think when accuracy()
tries to prepare the .train
input for accuracy measures like MASE
, and the input combines data from multiple keys, it should give NA
as the training data is not unique to a single series. :shrug:
OK. Supposing we do want to compute the MASE with a different scaling factor per series, what is the simplest way to compute it?
Compute MASE with accuracy separately for each series, then compute the mean from the result.
library(fpp3)
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✓ tibble 3.1.6 ✓ tsibble 1.1.1
#> ✓ dplyr 1.0.7 ✓ tsibbledata 0.4.0.9000
#> ✓ tidyr 1.1.4 ✓ feasts 0.2.2.9000
#> ✓ lubridate 1.8.0 ✓ fable 0.3.1.9000
#> ✓ ggplot2 3.3.5.9000
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> x lubridate::date() masks base::date()
#> x dplyr::filter() masks stats::filter()
#> x tsibble::intersect() masks base::intersect()
#> x tsibble::interval() masks lubridate::interval()
#> x dplyr::lag() masks stats::lag()
#> x tsibble::setdiff() masks base::setdiff()
#> x tsibble::union() masks base::union()
fc <- aus_arrivals %>%
filter(year(Quarter) <= 2009) %>%
model(
ets = ETS(Arrivals),
arima = ARIMA(Arrivals)
) %>%
forecast(h=11)
z <- fc %>%
accuracy(aus_arrivals) %>%
select(.model, Origin, MASE)
z %>%
group_by(.model) %>%
summarise(MASE = mean(MASE))
#> # A tibble: 2 × 2
#> .model MASE
#> <chr> <dbl>
#> 1 arima 2.08
#> 2 ets 1.75
Created on 2022-02-07 by the reprex package (v2.0.0)
I think an alternative approach that would work here is reconsidering how accuracy measures are specified. Instead of always reducing the output into a single unit summary, it could be more valuable to expose the measure before combination.
For instance, MAE
would be comprised of a mean
of an absolute_error
. Then mean
can be swapped out for median
if you like.
MASE
would be comprised of mean
and absolute_scaled_error
, and so on.
This would allow us to compute the f(errors)
by series, and then combine them with mean
/median
/summary_fn
grouped in accordance with the by
argument.
An alternative summary_fn
that could be of interest is dist_sample
, which would allow you to look at the distribution of the errors used in computing the summarised accuracy measure.
Yes, that could be better. The problem with just averaging the results as in your first suggestion is that it only works for means. medians for example would not work.
Right. I'll try to rework the accuracy()
interface to compute accuracy measures in the two stage approach mentioned above.
This doesn't look right:
Created on 2022-01-12 by the reprex package (v2.0.1)
The second table should be the average of the results by model in the first table.