Other issues in caret (https://github.com/topepo/caret/issues/968) have highlighted how predict()with glmnet models requires all features in the training to be present in the testing data for prediction. However, I have noticed decreased performance on models trained with a change of the order of the training features. The training and testing data is from the same larger dataset, but feature selection that changed the order of the training features was performed on the training data before a glmnet model was trained. I imagine we see this because glmnet selects by column number, not column name? As such, I imagine it is a requirement for features in the testing data to be a similar order of the training data features when using glmnet in caret?
library(caret)
set.seed(1)
dat <- twoClassSim(100, noiseVars = 5)
# feature selection that changes order of training features
set.seed(1)
featureSelection <- sample(1:ncol(dat)-1, 10)
X <- dat[,featureSelection]
y <- dat[["Class"]]
# fit glmnet model
set.seed(1)
glm_fit <- train(
X, y,
metric='ROC',
method = 'glmnet',
preProcess = c("center", "scale"),
trControl=trainControl(
method="cv",
number=5,
classProbs=TRUE,
summaryFunction=twoClassSummary,
savePredictions="final")
)
# predictions with the full original dataset
table(predict(glm_fit, dat) == y) # FALSE 44 TRUE 56
# predictions with feature selection applied to the full original dataset
table(predict(glm_fit, dat[,featureSelection]) == y) # FALSE 28 TRUE 72
# predictions when feature selection applied but in a different order
table(predict(glm_fit, dat[,sort(featureSelection)]) == y) # FALSE 44 TRUE 56
# feature selection in a similar order of the training features
X <- dat[,sort(featureSelection)]
# fit glmnet model
set.seed(1)
glm_fit <- train(
X, y,
metric='ROC',
method = 'glmnet',
preProcess = c("center", "scale"),
trControl=trainControl(
method="cv",
number=5,
classProbs=TRUE,
summaryFunction=twoClassSummary,
savePredictions="final")
)
# predictions with the full original dataset
table(predict(glm_fit, dat) == y) # FALSE 28 TRUE 72
# predictions with feature selection applied to the full original dataset
table(predict(glm_fit, dat[,featureSelection]) == y) # FALSE 28 TRUE 72
# predictions when feature selection applied but in a different order
table(predict(glm_fit, dat[,sort(featureSelection)]) == y) # FALSE 28 TRUE 72
The problem
Other issues in caret (https://github.com/topepo/caret/issues/968) have highlighted how predict()with glmnet models requires all features in the training to be present in the testing data for prediction. However, I have noticed decreased performance on models trained with a change of the order of the training features. The training and testing data is from the same larger dataset, but feature selection that changed the order of the training features was performed on the training data before a glmnet model was trained. I imagine we see this because glmnet selects by column number, not column name? As such, I imagine it is a requirement for features in the testing data to be a similar order of the training data features when using glmnet in caret?
It seems that similar issues identified in tidymodels (https://github.com/tidymodels/parsnip/issues/273) and glmnet (https://stackoverflow.com/questions/18420586/can-i-do-predict-glmnet-on-test-data-with-different-number-of-predictor-variable) have answered my questions. This issue has been rectified in tidymodels and should be rectified in caret as it is still a popular method for ML in R.
Reproducible example