topepo / C5.0

An R package for fitting Quinlan's C5.0 classification model
https://topepo.github.io/C5.0/
50 stars 20 forks source link

Prediction on training data does not match with summary output #51

Open zanocom opened 1 year ago

zanocom commented 1 year ago

Hi all , I find a strange result when I try to compare the output of a model from summary() with predict() on the same training set.

The same units are not classified in the same way, so I find two different confusion matrix. The issue arise with trials >1 and at least 3 variables in training set. I would expect the same results but maybe I misunderstood the inner workings of the algo.

I use R 4.1.1 and package C50 0.1.8

This is a code that reproduce the issue from credit_data dataset:

##################################################################################

library(modeldata)
data(credit_data)

vars <- c("Home", "Seniority", 'Job')

# a simple split
set.seed(2411)
in_train <- sample(1:nrow(credit_data), size = 3000)
train_data_example <- credit_data[ in_train,]
test_data_example  <- credit_data[-in_train,]

library( C50)
library( yardstick )
tree_mod <- C5.0(x = train_data_example[, vars], 
                 y = train_data_example$Status 
                 , trials = 10 
                 , seed = 65 
                 )

summary(tree_mod)

prediction_df_train <- tibble(value = train_data_example$Status , 
                                predict =  predict(tree_mod, newdata = train_data_example[, vars])  )

conf_mat(prediction_df_train , truth = value, estimate = predict)

confusion matrix in summary( tree_mod )

id different than confusion matrix built from predict()

##################################################################################

Thank you, Massimo