topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

Handling of Weird Variable Names Not Discussed in Book #1298

Open DarioS opened 2 years ago

DarioS commented 2 years ago

I would like to see a paragraph in the book about how weird variable names should be handled. Perhaps in Chapter 18. For example,

library(mlbench)
data("Sonar")
colnames(Sonar)[1:60] <- paste("(Var", 1:60) # Create strange-looking column names.
training <- Sonar[ inTraining, ]
gbmFit1 <- train(Class ~ ., data = training,  method = "gbm", trControl = fitControl)
head(gbmFit1$finalModel$var.names)
    "`(Var 1`" "`(Var 2`" "`(Var 3`" "`(Var 4`" "`(Var 5`" "`(Var 6`"

# Error due to naming differences.
predict(gbmFit1$finalModel, Sonar)
    Using 150 trees...
    Error in object$var.levels[[i]] : subscript out of bounds

In a real bioinformatics data set, I have names like Cer(d16:1/20:0) and Sph(d18:2).

DarioS commented 2 years ago

For comparison, mlr3 blocks the user from being able to perform classification unless the names are syntactically valid.

> library(mlr3)
> colnames(iris) <- gsub("\\.", ' ', colnames(iris))
> head(iris)
  Sepal Length Sepal Width Petal Length Petal Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> task <- as_task_classif(iris, "Species", id = "irises")
Error in .__Task__initialize(self = self, private = private, super = super,  : 
  Assertion on 'column names' failed: Must have names according to R's variable naming conventions, but element 1 does not comply.

It is a take-no-prisoners approach.