topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

Timeslice with longitudinal data #1291

Open geraldine28 opened 2 years ago

geraldine28 commented 2 years ago

I am new to caret and have a beginner's question regarding the 'timeslice' argument in caret's 'train' function.

I originally have a balanced panel data set with 22 years and 37,442 unique cross-sectional observations. Here is an example data set to exemplify the structure of the data

dat <- data.frame( id = sort( rep( c( "A", "B", "C" ), 22 )), 
                    t = rep( 2000:2021, 3 ), 
                    y = round( runif( 66, 10, 200 ), 0 ), 
                   x1 = rnorm( 66 ), 
                   x2 = rbinom( 66, 3, 0.3 ))

I tried to use 'train' to run a simple random forest model on the data with a fixed time window of 5 years and a horizon of 2 years:

library( caret )
library( ranger )

model <- train( 
             y ~ ., 
             tuneLength = 5, 
             data = dat, 
             method = "ranger", 
             trControl = trainControl(
                              method = "timeslice", 
                              initialWindow = 5,
                              horizon = 2,
                              allowParallel = TRUE,
                              verboseIter = TRUE, 
                              seeds = NULL
             ),
             metric = "RMSE"
          )

However, this gives the following error:

Error in sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop) +  : 
  cannot take a sample larger than the population when 'replace = FALSE'

I presume this error occurs because the data is not a time series but a longitudinal data set. So my question is how this can be handled with 'timeslice'?

jsacerot commented 1 year ago

You can define your target variable (y) and predictors (x) separately.

dat <- as.matrix(dat)
drop.col = -c(3)

model <- train( 
             y = dat$y, 
             x = dat[, drop.col]
             tuneLength = 5, 
             data = dat, 
             method = "ranger", 
             trControl = trainControl(
                              method = "timeslice", 
                              initialWindow = 5,
                              horizon = 2,
                              allowParallel = TRUE,
                              verboseIter = TRUE, 
                              seeds = NULL
             ),
             metric = "RMSE"
          )

NOTE: id needs to be encoded as factor using e.g., factor()