topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

GLMNET predict when test data features are in different order #1343

Open RGMurphy opened 1 year ago

RGMurphy commented 1 year ago

The problem

Other issues in caret (https://github.com/topepo/caret/issues/968) have highlighted how predict()with glmnet models requires all features in the training to be present in the testing data for prediction. However, I have noticed decreased performance on models trained with a change of the order of the training features. The training and testing data is from the same larger dataset, but feature selection that changed the order of the training features was performed on the training data before a glmnet model was trained. I imagine we see this because glmnet selects by column number, not column name? As such, I imagine it is a requirement for features in the testing data to be a similar order of the training data features when using glmnet in caret?

It seems that similar issues identified in tidymodels (https://github.com/tidymodels/parsnip/issues/273) and glmnet (https://stackoverflow.com/questions/18420586/can-i-do-predict-glmnet-on-test-data-with-different-number-of-predictor-variable) have answered my questions. This issue has been rectified in tidymodels and should be rectified in caret as it is still a popular method for ML in R.

Reproducible example


library(caret)
set.seed(1)
dat <- twoClassSim(100, noiseVars = 5)

# feature selection that changes order of training features
set.seed(1)
featureSelection <- sample(1:ncol(dat)-1, 10)
X <- dat[,featureSelection]
y <- dat[["Class"]]

# fit glmnet model
set.seed(1)
glm_fit <- train(
  X, y, 
  metric='ROC',
  method = 'glmnet',
  preProcess =  c("center", "scale"),
  trControl=trainControl(
    method="cv", 
    number=5,
    classProbs=TRUE, 
    summaryFunction=twoClassSummary,
    savePredictions="final")
)

# predictions with the full original dataset
table(predict(glm_fit, dat) == y) # FALSE 44 TRUE 56

# predictions with feature selection applied to the full original dataset
table(predict(glm_fit, dat[,featureSelection]) == y) # FALSE 28 TRUE 72

# predictions when feature selection applied but in a different order
table(predict(glm_fit, dat[,sort(featureSelection)]) == y) # FALSE 44 TRUE 56

# feature selection in a similar order of the training features
X <- dat[,sort(featureSelection)]

# fit glmnet model
set.seed(1)
glm_fit <- train(
  X, y, 
  metric='ROC',
  method = 'glmnet',
  preProcess =  c("center", "scale"),
  trControl=trainControl(
    method="cv", 
    number=5,
    classProbs=TRUE, 
    summaryFunction=twoClassSummary,
    savePredictions="final")
)

# predictions with the full original dataset
table(predict(glm_fit, dat) == y) # FALSE 28 TRUE 72

# predictions with feature selection applied to the full original dataset
table(predict(glm_fit, dat[,featureSelection]) == y) # FALSE 28 TRUE 72

# predictions when feature selection applied but in a different order
table(predict(glm_fit, dat[,sort(featureSelection)]) == y) # FALSE 28 TRUE 72