caret rpart doesnt work as rpart::rpart() #1057

Closed

fahadshery commented 5 years ago


I successfully created an rpart model by:

inTraining2 <- createDataPartition(complaints_4_trees$COMPLAINT_TYPE_SIMPLIFIED,p = 0.8,list = FALSE,times = 1)
train2 <- complaints_4_trees[inTraining2,]
test2 <- complaints_4_trees[-inTraining2,]

down_tree_new <- downSample(train2,train2$COMPLAINT_TYPE_SIMPLIFIED)

fitTree3 <- rpart(COMPLAINT_TYPE_SIMPLIFIED ~ ., data = down_tree_new,
                 method = "class")


fitTree3_predicted <- predict(fitTree3, test2, type = "class")


I want to do the same using train() but having various problems. Here is how I am trying to build rpart using train:

##caret not being happy with the factor levels so further simplifying them by:

down_tree_new <- down_tree_new %>% mutate(COMPLAINT_TYPE_SIMPLIFIED = fct_recode(COMPLAINT_TYPE_SIMPLIFIED,
                                                                  "C4"  =  "Non ELC/HLC/MP (C4)",
                                                                   "Exec"   = "Exec Level"

                                                                  "C4"  =  "Non ELC/HLC/MP (C4)",
                                                                   "Exec"   = "Exec Level"

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5,
                   classProbs = TRUE, summaryFunction = multiClassSummary)

caret_down_fit1 <- train(COMPLAINT_TYPE_SIMPLIFIED ~ ., data = down_tree_new,
                          method = "rpart",
                          na.action = na.pass,


pred <- predict(caret_down_fit1$finalModel, newdata = test2)

This gives the following error:

Error in eval(predvars, data, env) : object 'TOT_CONTCT_FOR_COMPLNT_28Dbin1' not found

However, this error goes away if I do:

pred <- predict(caret_down_fit1, newdata = test2)

Then it doesn't predict all the rows in test2:

confusionMatrix(pred, test2$COMPLAINT_TYPE_SIMPLIFIED)

Gives the following error:

Error in confusionMatrix.default(pred, test2$COMPLAINT_TYPE_SIMPLIFIED) : The data contain levels not found in the data.

Here is the data (Couldn't upload .RData file so saved it in .txt format:

I am new to ML so apologies in advance if I am doing something stupid :)

topepo commented 5 years ago

No worries!

Some things that might not have been obvious that I would try:

I think that this second point is the issue. train.formula made your model using dummy variables (like the column TOT_CONTCT_FOR_COMPLNT_28Dbin1) but the data frame test2 only has a column TOT_CONTCT_FOR_COMPLNT_28D. Try using the predict function without specifying the finalModel element.