nredell / forecastML

An R package with Python support for multi-step-ahead forecasting with machine learning and deep learning algorithms
Other
130 stars 23 forks source link

Enhancement::Unwrapping dynamic features in forecasting dataset #30

Closed edgBR closed 4 years ago

edgBR commented 4 years ago

Hi Nick,

I hope that you are doing fine.

I am using the periodogram function from aTSA package to get the top frequencies of my time series and build the fourier componentes with the forecast::fourier() function consequiently. I am also using time_tk::tk_augment_timeseries_signature() and time_tk::tk_augment_roll_apply() to get calendar features and moving averages.

I consider all of these features dynamic as the fourier function can be called with h as forecasting horizon to get this components for the next h steps and the same with the time_tk functions as we only need to use the tk_make_future_timeseries to generate the time series and build the calendar features consequently. With these two approaches I can build my models pretty quickly using parsnip or caret but still I need to handle the moving averages as well as the lags (which is a pain on the ass to be honest with you). Because of that I decided to give it a try to the package, as I mentioned before I think is the real deal for 2019 in time series forecasting :)

The problem that I have is that it is really complex to unwrap these features in the forecasting dataset. My functions for augmentation are as follows:

featureAugmentationTimeBased <- function (df, target, time_index, ma_windows, ma_aligment) {
  df <- df %>% tk_augment_timeseries_signature({{time_index}}) %>% 
    tk_augment_roll_apply({{target}}, .period = ma_windows, .f = AVERAGE, .align = ma_aligment) %>% 
    select_if(negate(is.factor)) %>%  
    select_if(negate(is.character)) %>% 
    drop_na() 
    return(df)
}

featureAugmentationFourier <- function (df, target, fourier_freq, fourier_components_number) {
  ts <- msts(df %>% select({{target}}), seasonal.periods = fourier_freq)
  ts <-fourier(ts, K = fourier_components_number)
  ts <- as_tibble(ts)
  joined_fourier <- bind_cols(df, ts)
  return(joined_fourier)
}

As the number of fourier components and seasonal periods are different depending on the input (and the same happens with the moving average windows) it is really tricky to hard-code the unwrapping as:

for (i in seq_along(data_forecast_list)) {
  data_forecast_list[[i]]$featureone <- 
  data_forecast_list[[i]]$featuretwo <- 
 ......
}

Do you have some suggestion for solving this issue? Could it be added as an enhancement for the next version?

BR /E

nredell commented 4 years ago

You bring up some good points. I'm on the cusp of making another jupyter notebook similar to the Spark and H2O one that focuses on ARIMA vs. the adaptive LASSO and how the feature selection/strength is comparable between methods.

This seems like a good opportunity to struggle through this problem.

I'm sitting on the v0.9.0 release of forecastML until R 4.0 gets released tomorrow and dplyr v1.0.0 gets released in early may. I've made quite a few fixes/enhancements since v0.8.0. If I can figure this out and get it unit tested by the first week of May then it'll make that CRAN release. To be continued. I'll reply here with any progress.

edgBR commented 4 years ago

Hi Nick,

Kind of finding my way around as:

    for (i in seq_along(data_forecast_list)) {
      data_forecast_list[[i]] <- data_forecast_list[[i]] %>% select_if(function(x){!all(is.na(x))}) %>% 
        tk_augment_timeseries_signature(., "index") %>% select(-c(diff)) %>% 
        select_if(negate(is.factor)) %>%  
        select_if(negate(is.character))

    }

Now trying to fit the fourier features. Maybe you should check the integration with the time_tk package.

I'll keep you posted.

BR /E

nredell commented 4 years ago

Glad that it's semi-working out for now. Never used timetk. Looks like it's popular. I'll play around with it when making an applied example.

In the spirit of keeping this package as generic as possible, I probably won't take a dependency here.

The direction I'm leaning is something like create_lagged_df(..., type = "forecast", predict_future = list("my_existing_feature_name_1" = my_custom_fun_1, "my_existing_feature_name_2" = my_custom_fun_2))

where the custom function returns a vector of any length from 1:infinity and then, behind the scenes, if the forecast horizon is, say, 3 steps ahead, only the first 3 values are kept...that way the same function can work seamlessly across multiple direct horizons.

my_custom_fun_1 <- function(index, x) {
  return(timetk::some_fun(index, x))
}
nredell commented 4 years ago

This block of code should run with the GitHub version. It's a combination of the lightning example in the README and timetk. create_lagged_df(..., predict_future = NULL) is the new argument. An example function is below.

library(tidyverse)
library(recipes)
library(yardstick)
library(timetk)
library(glmnet)

data("data_seatbelts", package = "forecastML")

dates <- seq(as.Date("1969-01-01"), as.Date("1984-12-01"), by = "1 month")
data_seatbelts$date <- dates

# Add time series signature
recipe_spec_timeseries <- recipe(DriversKilled ~ ., data = data_seatbelts) %>%
  step_timeseries_signature(date)

data_seatbelts <- recipes::bake(recipes::prep(recipe_spec_timeseries), new_data = data_seatbelts)

data_seatbelts$date <- NULL

data_seatbelts[] <- lapply(data_seatbelts, as.numeric)  # For glmnet.

# The names of the new date features from timetk plus another non-date dynamic feature.
dynamic_features <- c("law", names(data_seatbelts)[-(1:4)])

data_train <- forecastML::create_lagged_df(data_seatbelts, type = "train", method = "direct",
                                           outcome_col = 4, lookback = 1:12, horizon = 1:12,
                                           dates = dates, frequency = "1 month",
                                           dynamic_features = dynamic_features)

windows <- forecastML::create_windows(data_train, window_length = 0)

model_fun <- function(data) {
  x <- as.matrix(data[, -1, drop = FALSE])
  y <- as.matrix(data[, 1, drop = FALSE])
  model <- glmnet::cv.glmnet(x, y, alpha = 1)
}

model_results <- forecastML::train_model(data_train, windows, model_name = "LASSO", model_function = model_fun)

prediction_fun <- function(model, data_features) {
  data_pred <- data.frame("y_pred" = predict(model, as.matrix(data_features)),
                          "y_pred_lower" = predict(model, as.matrix(data_features)) - 30,
                          "y_pred_upper" = predict(model, as.matrix(data_features)) + 30)
}
#------------------------------------------------------------------------------
# A custom function to predict the future values of dynamic features.
# 'data' is the input data to create_lagged_df().
# 'index' is the input date vector in create_lagged_df().
# Both input arguments are positional.
predict_future <- function(data, index) {

  # We'll set n_future to max(horizons). This doesn't have to be the case. That is,
  # if only the next few future values are known, the remaining ones can remain 'NA'.
  index_future <- index %>% timetk::tk_make_future_timeseries(n_future = 12)

  data_future <- tibble::tibble("date" = index_future)  # 'date' is the same column name used in the training data.

  recipe_future <- recipes::recipe(data_future) %>%  # Make all of the time features.
    step_timeseries_signature(date)

  data_future <- recipes::bake(recipes::prep(recipe_future), new_data = data_future)  # Recipe stuff.

  data_future$law <- 1  # A non-date-based dynamic feature whose value we know.

  names(data_future)[1] <- "index"

  # Numeric features are required for glmnet in this example.
  data_future[, -1] <- lapply(data_future[, -1], as.numeric)

  # The returned data.frame needs to have an "index" column of dates and 1 or more predicted dynamic features.
  # No need to predict all dynamic features or all future time horizons.
  return(data_future)
}
#------------------------------------------------------------------------------

data_forecast <- forecastML::create_lagged_df(data_seatbelts, type = "forecast", method = "direct",
                                              outcome_col = 4, lookback = 1:12, horizon = 1:12,
                                              dates = dates, frequency = "1 month",
                                              dynamic_features = dynamic_features,
                                              predict_future = predict_future
                                              )

data_forecasts <- predict(model_results, prediction_function = list(prediction_fun), data = data_forecast)

data_forecasts <- forecastML::combine_forecasts(data_forecasts)

plot(data_forecasts, data_actual = data_seatbelts[-(1:150), ], actual_indices = dates[-(1:150)])

test_plot

nredell commented 4 years ago

Notes before I close: