topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

Feature: allow burn-in in createTimeSlices #1300

Open BPJandree opened 2 years ago

BPJandree commented 2 years ago

Thanks for making one of the best R packages ever!

I'd like to suggest a minor feature for the function createTimeSlices inside https://github.com/topepo/caret/blob/master/pkg/caret/R/createDataPartition.R

There are some validation test statistics whose proofs require the train and test sample to be separated by a small burn-in sample to avoid dependence between the two samples (mostly to address residual dependence when the model is not correctly specified). For instance Proposition 3 Chapter 4 in Andree, B. P. J. (2020). Theory and Application of Dynamic Spatial Time Series Models. Rozenberg Publishers and the Tinbergen Institute, propose a Diebold Mariano statistic that tests the significance of Log Likelihood differences on a validation sample with a small burn-in.

Below is a simple modification of the time slices function that would make such things easier to execute.

      createTimeSlices <- function (y, initialWindow, horizon = 1, fixedWindow = TRUE, 
          skip = 0, burnin = 0) 
      {
          stops <- seq(initialWindow, (length(y) - horizon), by = skip + 
              1)
          if (fixedWindow) {
              starts <- stops - initialWindow + 1
          }
          else {
              starts <- rep(1, length(stops))
          }
          stopifnot(burnin < horizon)
          train <- mapply(seq, starts, stops, SIMPLIFY = FALSE)
          test <- mapply(seq, stops + 1 + burnin, stops + horizon, SIMPLIFY = FALSE)
          nums <- gsub(" ", "0", format(stops))
          names(train) <- paste("Training", nums, sep = "")
          names(test) <- paste("Testing", nums, sep = "")
          out <- list(train = train, test = test)
          out
      }

Here I'm using it with a single observation as burn-in:

> createTimeSlices(1:10, 5, 3, TRUE, 0, 1)
$train
$train$Training5
[1] 1 2 3 4 5

$train$Training6
[1] 2 3 4 5 6

$train$Training7
[1] 3 4 5 6 7

$test
$test$Testing5
[1] 7 8

$test$Testing6
[1] 8 9

$test$Testing7
[1]  9 10

I added a simple error message when the burn-in sample leads to discarding the entire validation sample:

> createTimeSlices(1:10, 5, 3, TRUE, 0, 10)
Error in createTimeSlices(1:10, 5, 3, TRUE, 0, 10) : 
  burnin < horizon is not TRUE

Kind regards, Bo