topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

Error "Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures." #1124

Closed fernandafalves closed 3 years ago

fernandafalves commented 4 years ago

I'm having a problem when using the k-fold cross-validation with the Random Forest method using Caret package. Initially, one of the outputs was the error "Error in randomForest.default(x, y, mtry = param$mtry, ...) : Need at least two classes to do classification." However, I already had two classes to do the classification, which are "Normal" and "Failure". When posting this question at https://datascience.stackexchange.com/questions/69660/recommendations-for-statistical-models-given-my-dataset/69686#69686 one of the recommendations was to use the stratified k-fold cross-validation given that my dataset have much more classes "Normal" than "Failure". However, after the implementation of such method, the message "Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures." appears.

Someone could help me?

The R script:

library(caret)    
library(randomForest)

data_failures <- read.csv('OUTPUT.csv', header = TRUE, sep = ",", stringsAsFactors = TRUE)

cvIndex <- createFolds(factor(data_failures$Period_12), folds, returnTrain = TRUE)

tc <- trainControl(index = cvIndex, method = 'cv', number = folds)

model <- train(Period_12 ~., data = data_failures, method = "rf", trControl = tc)

print(model)

The ouput:

Random Forest

112 samples
 11 predictor
  2 classes: 'Failure', 'Normal'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 101, 101, 100, 100, 101, 101, ...
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa
   2    0.9750000   0.00000000
   6    0.9750000   0.00000000
  11    0.9666667  -0.03030303

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

A sample of the data:

    Period_1 Period_2 Period_3 Period_4 Period_5 Period_6 Period_7 Period_8
1     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
2     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
3     Normal   Normal  Failure   Normal   Normal   Normal   Normal   Normal
4     Normal  Failure   Normal   Normal   Normal   Normal   Normal  Failure
5     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
6     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
7     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
8     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
9     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
10    Normal  Failure   Normal   Normal   Normal   Normal   Normal   Normal
PierreAmiel commented 3 years ago

Hello, Did you find a solution to this problem or a way to work around it ? I am currently encountering the error message when using train() with 'method = "AdaBoost.M1" '. Thanks in advance for your answer.

topepo commented 3 years ago

In these cases, there is no error (just a warning).

There were missing values in resampled performance measures

This is almost always because some tuning parameter combination produced predictions that are constant for all samples. train() tries to compute the R^2 and, since it needs a non-zero variance, it produces an NA for that statistic.

Rajeev-Bhattarai commented 1 year ago

I am also having the same problem regressing with XGboost. I am trying to compare its accuracy with other models like RF and SVM. RF and SVMs are producing R^2 values while XGboost is not. Is it considered scientific if we leave R^2 as NA or we have a way around? In addition, I am kind of confused to see some model working while some aren't. Thank you in advance.