tidyverts / fabletools

General fable features useful for extension packages
http://fabletools.tidyverts.org/
89 stars 31 forks source link

Confusing error messages thrown by as_fable() #277

Closed wkdavis closed 3 years ago

wkdavis commented 4 years ago

Based on a SO question. I believe the error messages in as_fable() could be a bit more informative for some cases. In this example you get an error saying that the column must be type <distribution> but instead it has type <distribution>.

library(tsibbledata)
#> Warning: package 'tsibbledata' was built under R version 3.6.2
library(tsibble)
#> Warning: package 'tsibble' was built under R version 3.6.2
library(fable)
#> Warning: package 'fable' was built under R version 3.6.2
#> Loading required package: fabletools
#> Warning: package 'fabletools' was built under R version 3.6.2
library(fabletools)

aus <- tsibbledata::hh_budget

fit <-  fabletools::model(aus, ARIMA = ARIMA(Debt))

fc_tsibble <- fit %>% 
  fabletools::forecast(., h = 2) %>%
  as_tibble(.) %>% 
  tsibble::as_tsibble(., key = c(Country, .model), index = Year)

fc_tsibble
#> # A tsibble: 8 x 5 [1Y]
#> # Key:       Country, .model [4]
#>   Country   .model  Year        Debt .mean
#>   <chr>     <chr>  <dbl>      <dist> <dbl>
#> 1 Australia ARIMA   2017  N(215, 21)  215.
#> 2 Australia ARIMA   2018  N(221, 63)  221.
#> 3 Canada    ARIMA   2017   N(188, 7)  188.
#> 4 Canada    ARIMA   2018  N(192, 21)  192.
#> 5 Japan     ARIMA   2017 N(106, 3.8)  106.
#> 6 Japan     ARIMA   2018 N(106, 7.6)  106.
#> 7 USA       ARIMA   2017  N(109, 11)  109.
#> 8 USA       ARIMA   2018  N(110, 29)  110.

as_fable(fc_tsibble, response = ".mean", distribution = Debt)
#> Error: `fbl[[chr_dist]]` must be a vector with type <distribution>.
#> Instead, it has type <distribution>.

Created on 2020-09-28 by the reprex package (v0.3.0)

In reality, or at least in my experience, the issue is that the response variable is set to the .mean column when it should be set to the distribution column (Debt in this case).

as_fable(fc_tsibble, response = "Debt", distribution = Debt)
#> # A fable: 8 x 5 [1Y]
#> # Key:     Country, .model [4]
#>   Country   .model  Year        Debt .mean
#>   <chr>     <chr>  <dbl>      <dist> <dbl>
#> 1 Australia ARIMA   2017  N(215, 21)  215.
#> 2 Australia ARIMA   2018  N(221, 63)  221.
#> 3 Canada    ARIMA   2017   N(188, 7)  188.
#> 4 Canada    ARIMA   2018  N(192, 21)  192.
#> 5 Japan     ARIMA   2017 N(106, 3.8)  106.
#> 6 Japan     ARIMA   2018 N(106, 7.6)  106.
#> 7 USA       ARIMA   2017  N(109, 11)  109.
#> 8 USA       ARIMA   2018  N(110, 29)  110.

Created on 2020-09-28 by the reprex package (v0.3.0)

I think between the error message and the documentation it's not clear why the response and distribution arguments should be set to the same column. More broadly, if the response variable must be a distribution (per the error message), then what is the difference between the 2 arguments, and/or in what case would they ever be different columns?

mitchelloharawild commented 4 years ago

The response variable should match the name of the response variable for your model (not the point forecasts). You are correct that it seems a bit redundant at the moment, and may be removed in the near future. The response variable name will likely be stored as part of the distribution object, or we may require that the column name for the distributions matches the response variable name.

mitchelloharawild commented 3 years ago

The new behaviour is that specifying a response variable here will update the response variable in the distributions. This shouldn't be an issue anymore:

library(tsibbledata)
library(tsibble)
#> 
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, union
library(fable)
#> Loading required package: fabletools
library(fabletools)

aus <- tsibbledata::hh_budget

fit <-  fabletools::model(aus, ARIMA = ARIMA(Debt))

fc_tsibble <- fit %>% 
  fabletools::forecast(., h = 2) %>%
  as_tibble(.) %>% 
  tsibble::as_tsibble(., key = c(Country, .model), index = Year)

fc_tsibble
#> # A tsibble: 8 x 5 [1Y]
#> # Key:       Country, .model [4]
#>   Country   .model  Year        Debt .mean
#>   <chr>     <chr>  <dbl>      <dist> <dbl>
#> 1 Australia ARIMA   2017  N(215, 21)  215.
#> 2 Australia ARIMA   2018  N(221, 63)  221.
#> 3 Canada    ARIMA   2017   N(188, 7)  188.
#> 4 Canada    ARIMA   2018  N(192, 21)  192.
#> 5 Japan     ARIMA   2017 N(106, 3.8)  106.
#> 6 Japan     ARIMA   2018 N(106, 7.6)  106.
#> 7 USA       ARIMA   2017  N(109, 11)  109.
#> 8 USA       ARIMA   2018  N(110, 29)  110.

as_fable(fc_tsibble, response = ".mean", distribution = Debt)
#> # A fable: 8 x 5 [1Y]
#> # Key:     Country, .model [4]
#>   Country   .model  Year        Debt .mean
#>   <chr>     <chr>  <dbl>      <dist> <dbl>
#> 1 Australia ARIMA   2017  N(215, 21)  215.
#> 2 Australia ARIMA   2018  N(221, 63)  221.
#> 3 Canada    ARIMA   2017   N(188, 7)  188.
#> 4 Canada    ARIMA   2018  N(192, 21)  192.
#> 5 Japan     ARIMA   2017 N(106, 3.8)  106.
#> 6 Japan     ARIMA   2018 N(106, 7.6)  106.
#> 7 USA       ARIMA   2017  N(109, 11)  109.
#> 8 USA       ARIMA   2018  N(110, 29)  110.

Created on 2021-01-08 by the reprex package (v0.3.0)

wkdavis commented 3 years ago

@mitchelloharawild Thanks!

mitchelloharawild commented 3 years ago

No worries.

To clarify for future readers, there is a right and wrong response variable that you need to specify. The value for as_fable(response = <chr>) should match the name(s) of the response variables from your data.

So if you are using the above dataset, and predicting Debt (with ARIMA(Debt) in this case), then you should use as_fable(..., response = "Debt").