tidyverts / fabletools

General fable features useful for extension packages
http://fabletools.tidyverts.org/
89 stars 31 forks source link

Compute scaled errors for accuracy by series, then summarise. #342

Open robjhyndman opened 2 years ago

robjhyndman commented 2 years ago

This doesn't look right:

library(fpp3)

fc <- aus_arrivals %>%
  filter(year(Quarter) <= 2009) %>% 
  model(
    ets = ETS(Arrivals),
    arima = ARIMA(Arrivals)
  ) %>%
  forecast(h=11)
fc %>% 
  accuracy(aus_arrivals) %>%
  select(.model, Origin, MASE)
#> # A tibble: 8 × 3
#>   .model Origin  MASE
#>   <chr>  <chr>  <dbl>
#> 1 arima  Japan  2.95 
#> 2 arima  NZ     0.576
#> 3 arima  UK     2.31 
#> 4 arima  US     2.46 
#> 5 ets    Japan  2.90 
#> 6 ets    NZ     0.656
#> 7 ets    UK     1.90 
#> 8 ets    US     1.53
fc %>% 
  accuracy(aus_arrivals, by=".model") %>%
  select(.model, MASE)
#> # A tibble: 2 × 2
#>   .model  MASE
#>   <chr>  <dbl>
#> 1 arima   6.49
#> 2 ets     6.38

Created on 2022-01-12 by the reprex package (v2.0.1)

The second table should be the average of the results by model in the first table.

robjhyndman commented 2 years ago

Bumping this. I need the accuracy function for something I'm writing.

mitchelloharawild commented 2 years ago

This unexpected behaviour is due to how the scaling term for MASE is computed. The by argument is used to group up the data passed to the accuracy functions, and when by = ".model" is used, the 'grouped' training data for MASE doesn't make much sense (it combines the training data of each Origin).

It's not clear to me how best to handle this for MASE, and other measures which might use information that doesn't match the indices of the forecasts (such as training data). For your expected results, you would want to pass multiple scaling terms (one for each series) to MASE(), but I think this verges on doing something too complicated in a function that may be surprising in other situations.

Another related consideration is what is the correct MASE scaling factor if you're grouping by a season (say accuracy by each quarter of the year). Would the correct scaling factor for MASE be based on the training data of each quarter? If the seasonality induces substantially different scales for each quarter, I think that could be reasonable.

So if anything, I think when accuracy() tries to prepare the .train input for accuracy measures like MASE, and the input combines data from multiple keys, it should give NA as the training data is not unique to a single series. :shrug:

robjhyndman commented 2 years ago

OK. Supposing we do want to compute the MASE with a different scaling factor per series, what is the simplest way to compute it?

mitchelloharawild commented 2 years ago

Compute MASE with accuracy separately for each series, then compute the mean from the result.

library(fpp3)
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✓ tibble      3.1.6          ✓ tsibble     1.1.1     
#> ✓ dplyr       1.0.7          ✓ tsibbledata 0.4.0.9000
#> ✓ tidyr       1.1.4          ✓ feasts      0.2.2.9000
#> ✓ lubridate   1.8.0          ✓ fable       0.3.1.9000
#> ✓ ggplot2     3.3.5.9000
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> x lubridate::date()    masks base::date()
#> x dplyr::filter()      masks stats::filter()
#> x tsibble::intersect() masks base::intersect()
#> x tsibble::interval()  masks lubridate::interval()
#> x dplyr::lag()         masks stats::lag()
#> x tsibble::setdiff()   masks base::setdiff()
#> x tsibble::union()     masks base::union()

fc <- aus_arrivals %>%
  filter(year(Quarter) <= 2009) %>% 
  model(
    ets = ETS(Arrivals),
    arima = ARIMA(Arrivals)
  ) %>%
  forecast(h=11)
z <- fc %>% 
  accuracy(aus_arrivals) %>%
  select(.model, Origin, MASE)
z %>% 
  group_by(.model) %>% 
  summarise(MASE = mean(MASE))
#> # A tibble: 2 × 2
#>   .model  MASE
#>   <chr>  <dbl>
#> 1 arima   2.08
#> 2 ets     1.75

Created on 2022-02-07 by the reprex package (v2.0.0)

mitchelloharawild commented 2 years ago

I think an alternative approach that would work here is reconsidering how accuracy measures are specified. Instead of always reducing the output into a single unit summary, it could be more valuable to expose the measure before combination.

For instance, MAE would be comprised of a mean of an absolute_error. Then mean can be swapped out for median if you like. MASE would be comprised of mean and absolute_scaled_error, and so on.

This would allow us to compute the f(errors) by series, and then combine them with mean/median/summary_fn grouped in accordance with the by argument.

An alternative summary_fn that could be of interest is dist_sample, which would allow you to look at the distribution of the errors used in computing the summarised accuracy measure.

robjhyndman commented 2 years ago

Yes, that could be better. The problem with just averaging the results as in your first suggestion is that it only works for means. medians for example would not work.

mitchelloharawild commented 2 years ago

Right. I'll try to rework the accuracy() interface to compute accuracy measures in the two stage approach mentioned above.