Information leakage with rolling averages.

Shafi2016 commented 2 years ago

Hello @topepo, You mentioned in your book "Feature Engineering and Selection: A Practical Approach for Predictive Models" In order to avoid data leakages, we should first split the data into train and test. Then take rolling averages of independent features. What is the possibility of data leakages if we take lagged rolling averages as given below? If we are forecasting six quarters into the future. We first take a long lag of six, then take rolling averages based on lag 6.

 tk_augment_slidify(
    .value  =  **PI_lag6,**
    .period  = rolling_periods,
    .f       = mean,
    .align   = "center", 
    .partial = TRUE)

library(Quandl)

# Tidymodeling
library(modeltime.ensemble)
library(modeltime)
library(tidymodels)

# Base Models

library(glmnet)
library(xgboost)

# Core Packages
library(tidyverse)
library(lubridate)
library(timetk)
df1 <- Quandl(code = "FRED/PINCOME",
              type = "raw",
              collapse = "monthly",
              order = "asc",
              end_date="2017-12-31")
df2 <- Quandl(code = "FRED/GDP",
              type = "raw",
              collapse = "monthly",
              order = "asc",
              end_date="2017-12-31")

per <- df1 %>% rename(PI = Value)%>% select(-Date)
gdp <- df2 %>% rename(GDP = Value) 

data <- cbind(gdp,per)
data1 <- tk_augment_differences(
  .data = data,
  .value = GDP:PI,
  .lags = 1,
  .differences = 1,
  .log = TRUE,
  .names = "auto") %>%
  select(-GDP,-PI) %>%

  rename(GDP = GDP_lag1_diff1,PI = PI_lag1_diff1) %>% 
  drop_na()

horizon    <- 6
lag_period <- 6
rolling_periods <- c(10:12)
data_pre_full <- data1 %>%
  # Add future window----
bind_rows(
  future_frame(.data = .,.date_var = Date, .length_out = horizon)
) %>%      

  # add lags----
tk_augment_lags(
  .value =  GDP : PI   , 
  .lags = lag_period) 

%>%
  # add lag rolling averages

  tk_augment_slidify(
    .value  =  PI_lag6,
    .period  = rolling_periods,
    .f       = mean,
    .align   = "center", 
    .partial = TRUE)

topepo commented 2 years ago

The code is mostly using @mdancho84's code so he would have a better answer.

If you are going to do that, we would advise doing it outside of the tidymodels code. We deliberately prevent the testing data to include the training data. There are some grey areas but, in those cases, you would have to pre-lag the data or pre-compute moving window statistics.

Shafi2016 commented 2 years ago

Thank you @topepo, Yes, codes from the timetk R package. I have discussed this with @mdancho84. His opinion is that taking the lagged-based rolling averages is fine. This is where I have confusion on the use of taking rolling averages as you suggest first to split the data then take the rolling average. Goulet et al(2021) (https://www.stevanovic.uqam.ca/GCLSS_MDTM_WP.pdf) suggest using the recursive approach to avoid data snooping. I found that if we take rolling averages double of the horizon that is if horizon <- 6, and rolling_periods <- c(10:12), the models such as Prophet, ARIMA, linear regression, etc. perform extremely very well. In some cases, R squared reaches 1 which creates doubts. While on the other hand nonlinear model (XGBoost, LightGBM, LSTM, RNN) does not perform very well on the same data. It seems with models with linear components are affected by forwarding bias. while nonlinear models do not.

juliasilge commented 2 years ago

@Shafi2016 To clarify, are you asking if you should/can compute the lag-based or window-based statistics before splitting into training and testing, or if doing so is an example of data leakage?

Shafi2016 commented 2 years ago

Thank you @juliasilge, Yes, Can we compute the lagged-based rolling averages before splitting data into training and testing? How many rolling lags we should take? Like, I gave the example above, if the future forecast is 6 months and rolling_periods is double ((10,11,12)), then some models (like Prophet, Linear regression, etc.) can do extremely well on the test data which creates some doubts for me. The Kaggle competition winners have used lagged rolling averages.

juliasilge commented 2 years ago

Like @topepo mentioned above, there are some gray areas here. This is more of a discussion about modeling practice and choices, rather than a bug report or feature request about rsample as software, so we invite you to post on RStudio Community to gather folks' experiences, opinions, and input in this area.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / rsample

Information leakage with rolling averages. #269