topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

SVM with classProbs=TRUE leads to different results #1172

Closed statistics88 closed 3 years ago

statistics88 commented 4 years ago

Minimal, reproducible example:

Minimal dataset:

## Data link : https://www.kaggle.com/uciml/pima-indians-diabetes-database

ldiabetes <- read.csv("C:/Users/Downloads/228_482_bundle_archive/diabetes.csv")

The results with and without classProbs=TRUE is not same.

Minimal, runnable code:

set.seed(123)
fitControl1 <- trainControl( method = "LOOCV",savePredictions = T,search = "random")
diabetes$Outcome=factor(diabetes$Outcome)
set.seed(123)
modelFitlassocvintm1 <- train((Outcome) ~ Pregnancies+BloodPressure+Glucose +
                                BMI+DiabetesPedigreeFunction +Age
                              , data=diabetes, 
                              method = "svmRadialSigma", 
                              trControl = fitControl1,
                              preProcess = c("center", "scale"),
                              tuneGrid=expand.grid(
                                .sigma=0.004930389,
                                .C=9.63979626))

set.seed(123)
fitControl2 <- trainControl( method = "LOOCV",savePredictions = T,classProbs = T)

set.seed(123)
modelFitlassocvintm2 <- train(make.names(Outcome) ~ Pregnancies+BloodPressure+Glucose +
                                BMI+DiabetesPedigreeFunction +Age
                              , data=diabetes, 
                              method = "svmRadialSigma", 
                              trControl = fitControl2,
                              preProcess = c("center", "scale"),
                              tuneGrid=expand.grid(
                                .sigma=0.004930389,
                                .C=9.63979626))

table(modelFitlassocvintm2$pred$X1 >0.5,modelFitlassocvintm1$pred$pred)

          0   1
  FALSE 560   0
  TRUE    8 200

subs1=cbind(modelFitlassocvintm2$pred$X1,modelFitlassocvintm2$pred$pred,modelFitlassocvintm1$pred$pred)
subset(subs1,subs1[,2]!=subs1[,3])
          [,1] [,2] [,3]
[1,] 0.5078631    2    1
[2,] 0.5056252    2    1
[3,] 0.5113336    2    1
[4,] 0.5048708    2    1
[5,] 0.5033003    2    1
[6,] 0.5014327    2    1
[7,] 0.5111975    2    1
[8,] 0.5136453    2    1

The results are different for 8 rows when the predicted probability is close to 0.5.

khushibalyan220101 commented 4 years ago

Can I contribute here?

statistics88 commented 4 years ago

Can I contribute here?

Sorry for the delay. Yes. You can

topepo commented 3 years ago

SVM models do not naturally produce class probabilities. The decision values are used (pairwise by class) in a secondary model to convert them to class probabilities (see Platt scaling). The kernlab package does this internally and the results can vary (and setting the seed has no effect 😞 ).

When class probabilities are requested, we compute the class probabilities then derive the predicted class (see getModelInfo("svmRadialSigma")[[1]]$predict). This may differ from the hard class predictions that you get when no probabilities are requested.

tl;dr It happens in kernlab and we have no control over it.