Enhancement::Unwrapping dynamic features in forecasting dataset

Hi Nick,

I hope that you are doing fine.

I am using the periodogram function from aTSA package to get the top frequencies of my time series and build the fourier componentes with the forecast::fourier() function consequiently. I am also using time_tk::tk_augment_timeseries_signature() and time_tk::tk_augment_roll_apply() to get calendar features and moving averages.

I consider all of these features dynamic as the fourier function can be called with h as forecasting horizon to get this components for the next h steps and the same with the time_tk functions as we only need to use the tk_make_future_timeseries to generate the time series and build the calendar features consequently. With these two approaches I can build my models pretty quickly using parsnip or caret but still I need to handle the moving averages as well as the lags (which is a pain on the ass to be honest with you). Because of that I decided to give it a try to the package, as I mentioned before I think is the real deal for 2019 in time series forecasting :)

The problem that I have is that it is really complex to unwrap these features in the forecasting dataset. My functions for augmentation are as follows:

featureAugmentationTimeBased <- function (df, target, time_index, ma_windows, ma_aligment) {
  df <- df %>% tk_augment_timeseries_signature({{time_index}}) %>% 
    tk_augment_roll_apply({{target}}, .period = ma_windows, .f = AVERAGE, .align = ma_aligment) %>% 
    select_if(negate(is.factor)) %>%  
    select_if(negate(is.character)) %>% 
    drop_na() 
    return(df)
}

featureAugmentationFourier <- function (df, target, fourier_freq, fourier_components_number) {
  ts <- msts(df %>% select({{target}}), seasonal.periods = fourier_freq)
  ts <-fourier(ts, K = fourier_components_number)
  ts <- as_tibble(ts)
  joined_fourier <- bind_cols(df, ts)
  return(joined_fourier)
}

As the number of fourier components and seasonal periods are different depending on the input (and the same happens with the moving average windows) it is really tricky to hard-code the unwrapping as:

for (i in seq_along(data_forecast_list)) {
  data_forecast_list[[i]]$featureone <- 
  data_forecast_list[[i]]$featuretwo <- 
 ......
}

Do you have some suggestion for solving this issue? Could it be added as an enhancement for the next version?

BR /E

You bring up some good points. I'm on the cusp of making another jupyter notebook similar to the Spark and H2O one that focuses on ARIMA vs. the adaptive LASSO and how the feature selection/strength is comparable between methods.

This seems like a good opportunity to struggle through this problem.

I'm sitting on the v0.9.0 release of forecastML until R 4.0 gets released tomorrow and dplyr v1.0.0 gets released in early may. I've made quite a few fixes/enhancements since v0.8.0. If I can figure this out and get it unit tested by the first week of May then it'll make that CRAN release. To be continued. I'll reply here with any progress.

Hi Nick,

Kind of finding my way around as:

    for (i in seq_along(data_forecast_list)) {
      data_forecast_list[[i]] <- data_forecast_list[[i]] %>% select_if(function(x){!all(is.na(x))}) %>% 
        tk_augment_timeseries_signature(., "index") %>% select(-c(diff)) %>% 
        select_if(negate(is.factor)) %>%  
        select_if(negate(is.character))

    }

Now trying to fit the fourier features. Maybe you should check the integration with the time_tk package.

I'll keep you posted.

BR /E

Glad that it's semi-working out for now. Never used timetk. Looks like it's popular. I'll play around with it when making an applied example.

In the spirit of keeping this package as generic as possible, I probably won't take a dependency here.

The direction I'm leaning is something like create_lagged_df(..., type = "forecast", predict_future = list("my_existing_feature_name_1" = my_custom_fun_1, "my_existing_feature_name_2" = my_custom_fun_2))

where the custom function returns a vector of any length from 1:infinity and then, behind the scenes, if the forecast horizon is, say, 3 steps ahead, only the first 3 values are kept...that way the same function can work seamlessly across multiple direct horizons.

my_custom_fun_1 <- function(index, x) {
  return(timetk::some_fun(index, x))
}

This block of code should run with the GitHub version. It's a combination of the lightning example in the README and timetk. create_lagged_df(..., predict_future = NULL) is the new argument. An example function is below.

library(tidyverse)
library(recipes)
library(yardstick)
library(timetk)
library(glmnet)

data("data_seatbelts", package = "forecastML")

dates <- seq(as.Date("1969-01-01"), as.Date("1984-12-01"), by = "1 month")
data_seatbelts$date <- dates

# Add time series signature
recipe_spec_timeseries <- recipe(DriversKilled ~ ., data = data_seatbelts) %>%
  step_timeseries_signature(date)

data_seatbelts <- recipes::bake(recipes::prep(recipe_spec_timeseries), new_data = data_seatbelts)

data_seatbelts$date <- NULL

data_seatbelts[] <- lapply(data_seatbelts, as.numeric)  # For glmnet.

# The names of the new date features from timetk plus another non-date dynamic feature.
dynamic_features <- c("law", names(data_seatbelts)[-(1:4)])

data_train <- forecastML::create_lagged_df(data_seatbelts, type = "train", method = "direct",
                                           outcome_col = 4, lookback = 1:12, horizon = 1:12,
                                           dates = dates, frequency = "1 month",
                                           dynamic_features = dynamic_features)

windows <- forecastML::create_windows(data_train, window_length = 0)

model_fun <- function(data) {
  x <- as.matrix(data[, -1, drop = FALSE])
  y <- as.matrix(data[, 1, drop = FALSE])
  model <- glmnet::cv.glmnet(x, y, alpha = 1)
}

model_results <- forecastML::train_model(data_train, windows, model_name = "LASSO", model_function = model_fun)

prediction_fun <- function(model, data_features) {
  data_pred <- data.frame("y_pred" = predict(model, as.matrix(data_features)),
                          "y_pred_lower" = predict(model, as.matrix(data_features)) - 30,
                          "y_pred_upper" = predict(model, as.matrix(data_features)) + 30)
}
#------------------------------------------------------------------------------
# A custom function to predict the future values of dynamic features.
# 'data' is the input data to create_lagged_df().
# 'index' is the input date vector in create_lagged_df().
# Both input arguments are positional.
predict_future <- function(data, index) {

  # We'll set n_future to max(horizons). This doesn't have to be the case. That is,
  # if only the next few future values are known, the remaining ones can remain 'NA'.
  index_future <- index %>% timetk::tk_make_future_timeseries(n_future = 12)

  data_future <- tibble::tibble("date" = index_future)  # 'date' is the same column name used in the training data.

  recipe_future <- recipes::recipe(data_future) %>%  # Make all of the time features.
    step_timeseries_signature(date)

  data_future <- recipes::bake(recipes::prep(recipe_future), new_data = data_future)  # Recipe stuff.

  data_future$law <- 1  # A non-date-based dynamic feature whose value we know.

  names(data_future)[1] <- "index"

  # Numeric features are required for glmnet in this example.
  data_future[, -1] <- lapply(data_future[, -1], as.numeric)

  # The returned data.frame needs to have an "index" column of dates and 1 or more predicted dynamic features.
  # No need to predict all dynamic features or all future time horizons.
  return(data_future)
}
#------------------------------------------------------------------------------

data_forecast <- forecastML::create_lagged_df(data_seatbelts, type = "forecast", method = "direct",
                                              outcome_col = 4, lookback = 1:12, horizon = 1:12,
                                              dates = dates, frequency = "1 month",
                                              dynamic_features = dynamic_features,
                                              predict_future = predict_future
                                              )

data_forecasts <- predict(model_results, prediction_function = list(prediction_fun), data = data_forecast)

data_forecasts <- forecastML::combine_forecasts(data_forecasts)

plot(data_forecasts, data_actual = data_seatbelts[-(1:150), ], actual_indices = dates[-(1:150)])

test_plot

Notes before I close:

New argument create_lagged_df(..., predict_future = NULL).
The purpose is to predict the future values of dynamic regressors in forecast dataframes.
The returned data.frame from the custom function must have a date column titled "index".
For groups, a group column is not necessary. However, if the dynamic features vary by group then a column of group IDs with the same column name(s) as given in create_lagged_df(..., groups = NULL) will make sure that the group-level dynamic features get to the right place.
The returned data.frame from the custom feature prediction function can have fewer or more rows than the direct forecast horizon: fewer rows will leave NAs in the forecast dataset and extra rows will be ignored.
At present, the predict_future() function is evaluated once per direct forecast horizon. This could still be optimized for speed.

nredell / forecastML

Enhancement::Unwrapping dynamic features in forecasting dataset #30