tidyverts / fabletools

General fable features useful for extension packages
http://fabletools.tidyverts.org/
90 stars 31 forks source link

response(<mdl_ts>) is unable to backtransform from parameters existing in the data #301

Open mitchelloharawild opened 3 years ago

mitchelloharawild commented 3 years ago

Should the model object also store these parameters? From #103

library(fable)
#> Loading required package: fabletools
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
uad <- as_tsibble(USAccDeaths)
uad %>% 
  bind_cols(features(uad, value, feasts::guerrero)) %>% 
  model(SNAIVE(box_cox(resp(value), lambda_guerrero))) %>% 
  response()
#> Error: Problem with `mutate()` input `response`.
#> x object 'lambda_guerrero' not found
#> ℹ Input `response` is `map(.fit, response)`.

Created on 2021-01-08 by the reprex package (v0.3.0)

ltsaprounis commented 3 years ago

Thanks @mitchelloharawild! This will be a very powerful feature when we get it right. I'm trying to compare results between forecasts where I'm using box-cox with the lambda from guererro and no transformation. A couple of related issues below:

1 - You can't have this type of transformation along with a simple/no transformation in the same mdl_df

fit <- df_train %>% 
  model(guerrero_arima = ARIMA(box_cox(Turnover, lambda = lambda_guerrero)),
       arima = ARIMA(Turnover)
        )
#> Error: A mable can only contain models with the same response variable(s).

2 - You can't evaluate accuracy on the test set and it doesn't give an error on the train-set accuracy (although the back-transform has not been applied)

library(dplyr)
library(tsibble)
library(fabletools)
library(feasts)
library(fable)
library(tsibbledata)

df <- tsibbledata::aus_retail %>%
  filter(State == "Australian Capital Territory")

TEST_H <- 6
df_train <- df %>% filter(Month <= max(df$Month)-6)

# extract the optimal lambda for box-cox using the guerrero method
lambda_g <- df_train %>% features(Turnover, features = guerrero)
df_train <- df_train %>% left_join(lambda_g, by = c("State", "Industry"))

# fit data with box cox and lambda from guererro
fit <- df_train %>% 
  model(guerrero_arima = ARIMA(box_cox(Turnover, lambda = lambda_guerrero))
        )

fcast <- fit %>% forecast(h=TEST_H) 

eval_train <- accuracy(fit, measures = list(mape=MAPE, rmse=RMSE, mase=MASE, rmsse=RMSSE))

eval_test <- fcast %>%
  accuracy(
    data = df, measures = list(mape=MAPE, rmse=RMSE, mase=MASE, rmsse=RMSSE) 
  )
#> Error: Problem with `mutate()` input `.actual`.
#> x object 'box_cox(Turnover, lambda = lambda_guerrero)' not found
#> ℹ Input `.actual` is `box_cox(Turnover, lambda = lambda_guerrero)`.

I thought this was a promising work-around but it failed miserably.

fcast <- fit %>% 
  forecast(h=TEST_H) %>% 
  left_join(lambda_g, by = c("State", "Industry")) %>%
  rename(transformed_Turnover = "box_cox(Turnover, lambda = lambda_guerrero)") %>%
  mutate(Turnover = inv_box_cox(transformed_Turnover, lambda = lambda_guerrero)) %>%
  mutate(.mean = inv_box_cox(.mean, lambda = lambda_guerrero))
#> Error: Problem with `mutate()` input `Turnover`.
#> x `vec_proxy_compare.distribution()` not supported.
#> ℹ Input `Turnover` is `inv_box_cox(transformed_Turnover, lambda = lambda_guerrero)`.
mitchelloharawild commented 3 years ago

Thanks for the additional examples. As far as I am aware, these are all rooted in the response() method - but I'll check that they work once this is fixed. Your attempt to directly transform the distribution is not yet implemented as the inv_box_cox() code will attempt to evaluate <dist> < 0 - https://github.com/mitchelloharawild/distributional/issues/16 You can create a transformed distribution manually using dist_transformed(<dist>, ...).

rando-brando commented 2 years ago

To jump on the bandwagon regarding this powerful feature. I tried to create a very simple transformation that takes the index column as a second input in order to normalize the value column by days in the period with the same errors. It is unfortunate because such as transformation would have worked if x was of class 'ts'. Maybe there is a work around in the meantime but I haven't figured it out.

period_days <- function(x, i){
  f = guess_frequency(i)
  if (f == 12) {
    x <- x /  lubridate::days_in_month(i)
  }
  else if (f == 4) 
  {
    x <- x / days_in_quarter(i)
  }
  else if (f == 1)
  {
    x <- x / days_in_year(i)
  }
  else {
    x
  }
}
inv_period_days <- function(x, i){
  f = guess_frequency(i)
  if (f == 12) {
    x <- x * lubridate::days_in_month(i)
  }
  else if (f == 4) 
  {
    x <- x * days_in_quarter(i)
  }
  else if (f == 1)
  {
    x <- x * days_in_year(i)
  }
  else {
    x
  }
}

trans_period_days <- new_transformation(transformation = period_days, inverse = inv_period_days)
rando-brando commented 2 years ago

Edit: this approach is actually not correct So fortunately I figured out a useful workaround by trying to imitate the first(lambda_guerrero) example. My first problem was that above transformation returns a ts which raises an error as it requires a vector. My second problem, as we already know, was that I could not provide a vector of arbitrary length for the second input. Using this knowledge I updated my UDFs to create the required index vector by incrementing the input of first(index). The results of my transformation functions are shown using everyone's favorite sample dataset. Based on my experiment I would wager that most issues can be resolved with a clever use of a first(key) argument that selects the required vector from a lookup table. One example I can think of where this approach might work is with differencing transformations where inv_difference requires access to the original values. Again, thanks for all your work @mitchelloharawild.

# Test case
test <- UKLungDeaths
train <- test %>% filter_index(~ '1978 Dec')
new <- new_data(train, 12)

train %>%
  model(ets = ETS(value), trans = ETS(trans_period_days(value, first(index)))) %>%
  forecast(new) %>%
  autoplot(test)

image

# Imitation of lubridate::days_in_month functionality
N_DAYS_IN_QUARTERS <- c(90, 91, 92, 92)

days_in_year <- function(x)
{
  n_days <- rep(365, length(x))
  n_days[lubridate::leap_year(x)] <- 366
  n_days
}

days_in_quarter <- function(x) 
{
  quarter_x <- lubridate::quarter(x)
  n_days <- N_DAYS_IN_QUARTERS[quarter_x]
  n_days[quarter_x == 1 & lubridate::leap_year(x)] <- 91
  n_days
}

# custom UDFs
period_days <- function(x, i){
  i <- i + 1:length(x) - 1
  if (is_yearmonth(i))
    x <- x / lubridate::days_in_month(i)
  if (is_yearquarter(i)) 
    x <- x / days_in_quarter(i)
  if (is.numeric(i)) 
   x <- x / days_in_year(i)
  as.vector(x)
}

inv_period_days <- function(x, i){
  i <- i + 1:length(x) - 1
  if (is_yearmonth(i)) 
    x <- x * lubridate::days_in_month(i)
  if (is_yearquarter(i)) 
    x <- x * days_in_quarter(i)
  if (is.numeric(i)) 
    x <- x * days_in_year(i)
  as.vector(x)
}

trans_period_days <- new_transformation(transformation = period_days, inverse = inv_period_days)
mitchelloharawild commented 2 years ago

Hi @rando-brando, I think you're having issues with fable determining which variable is the one you want to predict. If you provide a transformation that varies over time (say days in period), then the function will have two inputs of the same length.

In your case, trans_period_days(value, index) has value with the same length as index. So the code doesn't know if you want to predict value or index. If you use first(index) it now has length 1, but the results you will get from your function won't be right. You can explicitly define the response variable of a model formula using the resp() function, for instance trans_period_days(resp(value), index). I believe this will work for your purposes.

rando-brando commented 2 years ago

Thank you for the fast response! I followed your suggestion and updated the function by removing the index incrementer and adding resp(value) as shown below. I no longer get the same error but this new one instead. I also can confirm the unintended result you mentioned from first(index). The incremeter appeared to work correctly via mutate but the backtransform uses 31 (from Jan) for all values.

test <- UKLungDeaths <- as_tsibble(cbind(mdeaths, fdeaths), pivot_longer = TRUE)
train <- test %>% filter_index(~ '1978 Dec')
new <- new_data(train, 12)
train %>%
     model(trans = ETS(trans_period_days_v2(resp(value), index))) %>% 
     forecast(new) %>%
     autoplot(test)
#Error in `mutate()`:
#! Problem while computing `trans = (function (object, ...) ...`.
#Caused by error in `hessian.default()`:
#! Richardson method for hessian assumes a scalar valued function.

For reference the updated UDF transformation is provided below along with a sample output comparison from mutate.

#transformation v2
period_days_v2 <- function(x, i){
  if (is_yearmonth(i))
    x <- x / lubridate::days_in_month(i)
  if (is_yearquarter(i)) 
    x <- x / days_in_quarter(i)
  if (is.numeric(i)) 
   x <- x / days_in_year(i)
  as.vector(x)
}
inv_period_days_v2 <- function(x, i){
  if (is_yearmonth(i)) 
    x <- x * lubridate::days_in_month(i)
  if (is_yearquarter(i)) 
    x <- x * days_in_quarter(i)
  if (is.numeric(i)) 
    x <- x * days_in_year(i)
  as.vector(x)
}
trans_period_days_v2 <- new_transformation(period_days_v2, inv_period_days_v2)

# v1 vs v2 - intended result of value / days_in_month(index) appears correct
train %>% 
   group_by_key %>% 
   mutate(old = trans_period_days(value, first(index)),
               new = trans_period_days_v2(value , index))

# A tsibble: 120 x 5 [1M]
# Key:       key [2]
# Groups:    key [2]
      index key     value   old   new
      <mth> <chr>   <dbl> <dbl> <dbl>
 1 1974 Jan mdeaths  2134  68.8  68.8
 2 1974 Feb mdeaths  1863  66.5  66.5
 3 1974 Mar mdeaths  1877  60.5  60.5
 4 1974 Apr mdeaths  1877  62.6  62.6
 5 1974 May mdeaths  1492  48.1  48.1
 6 1974 Jun mdeaths  1249  41.6  41.6
 7 1974 Jul mdeaths  1280  41.3  41.3
 8 1974 Aug mdeaths  1131  36.5  36.5
 9 1974 Sep mdeaths  1209  40.3  40.3
10 1974 Oct mdeaths  1492  48.1  48.1
mitchelloharawild commented 2 years ago

Interesting that you would encounter an error like that, it indicates that there is some issue computing the mean of the back-transformed forecast distribution. This is more of an implementation issue than anything, and is fixable. Could you try plotting the median instead and see if you get an error? This can be done with: autoplot(<fable>, point_forecast = lst(median)).

Alternatively you could try writing the response as value / days_in(index) where days_in() is a function you'll need to write that computes the number of days in that period. Then the transformation function is simpler, which might work for computing the mean using the current implementation.

rando-brando commented 2 years ago

I should have been more clear. The error is occurring in forecast so I applied median there instead. This new error makes me think my UDF is 12x larger than expected which might mean you are on to something. I'll try modifying the transformation function with your suggestion and see how that goes.

train %>%
   model(trans = ETS(days_in_period_v2(resp(value), index))) %>% 
   forecast(new, point_forecast = list(median))
#Error in `mutate()`:
#! Problem while computing `trans = (function (object, ...) ...`.
#Caused by error:
#! Assigned data `point_fc` must be compatible with existing data.
#✖ Existing data has 12 rows.
#✖ Assigned data has 144 rows.
#ℹ Only vectors of size 1 are recycled.
#Run `rlang::last_error()` to see where the error occurred.
mitchelloharawild commented 2 years ago

I'll try out your code and see if any changes are needed in the software to make this work.

rando-brando commented 2 years ago

After doing a quick trace its clear that the approach you suggested with trans_period_days(resp(value), index) works correctly for the modeling step providing an identical model and parameters as a pre-transformed input. When using the resp approach the entire value vector and index vector are passed by key to x and i as expected. However, during the back-transformation it has the appearance that each value from forecast(new) is passed to x sequentially rather than the entire vector at once like before. i on the other hand receives the entire index vector on every pass. As a result the back-transform repeatedly back-transforms the forecast value for the length of new$index which it was passed. I show the print output below prior to crashing. Is this the expected behavior for the back-transform? If so it seems that the index should be passed sequentially as well. Hopefully that makes sense.

# first transformed forecast of 25.76600 is multiplied by the length of days_in_month(i) or (31, 28, 31, 30 ...)
    Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec 
798.746 721.448 798.746 772.980 798.746 772.980 798.746 798.746 772.980 798.746 772.980 798.746
   Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec 
798.746 721.448 798.746 772.980 798.746 772.980 798.746 798.746 772.980 798.746 772.980 798.746
# <crashes here as it expects a return of length 1 not 12>

# simplified transforms
period_days_v3 <- function(x, i){
  x / lubridate::days_in_month(i)
}
inv_period_days_v3 <- function(x, i){
  x * lubridate::days_in_month(i)
}
trans_period_days_v3 <- new_transformation(period_days_v3, inv_period_days_v3)
# expected result of transformed forecast without back-transform
train %>%
    group_by_key() %>%
    mutate(value = trans_period_days_v3(value, index)) %>% 
    model(trans = ETS(value)) %>%
    forecast(new)

# A fable: 24 x 5 [1M]
# Key:     key, .model [2]
   key     .model    index      value .mean
   <chr>   <chr>     <mth>     <dist> <dbl>
 1 fdeaths trans  1979 Jan N(26, 8.6)  25.8
 2 fdeaths trans  1979 Feb  N(29, 11)  29.1
 3 fdeaths trans  1979 Mar N(24, 7.8)  24.4
 4 fdeaths trans  1979 Apr N(20, 5.2)  19.9
 5 fdeaths trans  1979 May N(15, 3.2)  15.4
 6 fdeaths trans  1979 Jun N(14, 2.5)  13.6
 7 fdeaths trans  1979 Jul N(13, 2.3)  13.0
 8 fdeaths trans  1979 Aug N(12, 1.8)  11.5
 9 fdeaths trans  1979 Sep   N(12, 2)  12.0
10 fdeaths trans  1979 Oct N(14, 2.8)  14.5