tidyverts / fabletools

General fable features useful for extension packages
http://fabletools.tidyverts.org/
89 stars 31 forks source link

generate() returns NA in rep 1 with lagged xregs #246

Closed wkdavis closed 4 years ago

wkdavis commented 4 years ago

I might be doing something wrong, but when I use lagged xregs in a model and then generate forecasts, the first n (where n is the max lag period) values are NA, but only for the first .rep.

Basic Forecast (no lags)

Works as expected.

library(tsibbledata)
library(tsibble)
library(fable)
library(dplyr)

fit <- aus_production %>%
  slice(1:(n()-4)) %>%
  model(TSLM = TSLM(Beer ~ Gas))

sim <- generate(fit, times = 3, new_data = slice_tail(aus_production, n = 4))

filter_index(sim, "2009 Q3" ~ "2009 Q4")
#> # A tsibble: 6 x 4 [1Q]
#> # Key:       .model, .rep [3]
#>   .model Quarter .rep  .sim[,1]
#>   <chr>    <qtr> <chr>    <dbl>
#> 1 TSLM   2009 Q3 1         489.
#> 2 TSLM   2009 Q4 1         479.
#> 3 TSLM   2009 Q3 2         487.
#> 4 TSLM   2009 Q4 2         477.
#> 5 TSLM   2009 Q3 3         501.
#> 6 TSLM   2009 Q4 3         467.

Forecast with lagged xregs

fit <- aus_production %>%
  slice(1:(n()-4)) %>%
  model(TSLMlag1 = TSLM(Beer ~ Gas + lag(Gas)))

sim <- generate(fit, times = 3, new_data = slice_tail(aus_production, n = 4))

filter_index(sim, "2009 Q3" ~ "2009 Q4")
#> # A tsibble: 6 x 4 [1Q]
#> # Key:       .model, .rep [3]
#>   .model   Quarter .rep  .sim[,1]
#>   <chr>      <qtr> <chr>    <dbl>
#> 1 TSLMlag1 2009 Q3 1          NA 
#> 2 TSLMlag1 2009 Q4 1         557.
#> 3 TSLMlag1 2009 Q3 2         476.
#> 4 TSLMlag1 2009 Q4 2         570.
#> 5 TSLMlag1 2009 Q3 3         471.
#> 6 TSLMlag1 2009 Q4 3         561.

This model results in an NA forecast for 2009 Q3, but only for the first rep. Initially I saw the NA and though it was because new_data doesn't contain lag = 1 for the first observation of the xreg (I would think the model could get it from the original training data, but that's a separate issue). However, I noticed that in all reps > 1, the value for 2009 Q3 was returned successfully.

From what I found, this generalizes to any n lag periods, where the first n values of the first .rep will be NA.

fit <- aus_production %>%
  slice(1:(n()-4)) %>%
  model(TSLMlag2 = TSLM(Beer ~ Gas + lag(Gas) + lag(Gas, 2)))

sim <- generate(fit, times = 3, new_data = slice_tail(aus_production, n = 4))

filter_index(sim, "2009 Q3" ~ "2010 Q1")
#> # A tsibble: 9 x 4 [1Q]
#> # Key:       .model, .rep [3]
#>   .model   Quarter .rep  .sim[,1]
#>   <chr>      <qtr> <chr>    <dbl>
#> 1 TSLMlag2 2009 Q3 1          NA 
#> 2 TSLMlag2 2009 Q4 1          NA 
#> 3 TSLMlag2 2010 Q1 1         485.
#> 4 TSLMlag2 2009 Q3 2         484.
#> 5 TSLMlag2 2009 Q4 2         570.
#> 6 TSLMlag2 2010 Q1 2         483.
#> 7 TSLMlag2 2009 Q3 3         482.
#> 8 TSLMlag2 2009 Q4 3         566.
#> 9 TSLMlag2 2010 Q1 3         492.

I'm not sure if this is a bug, or if I'm mis-understanding how I am supposed to supply new data to generate().

mitchelloharawild commented 4 years ago

Thanks, I've added recall for lagged values when generate() is used.

wkdavis commented 4 years ago

Thanks!