topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 636 forks source link

The train() y factor levels ordering adversly impacts XGBoost performance. #1143

Open JustGitting opened 4 years ago

JustGitting commented 4 years ago

I am working on a tricky classification problem that has unbalanced classes (10 to 1 ratio), with around 200,000 instances and approximately 10 features. I managed to get reasonable performance using XGBoost's xgb.train(), but wanted to do a systematic parameter grid search with CV. So I opted to use Caret to do the heavy lifting.

However, I was getting very poor results using the same data and even selecting the exact same parameters. I was stumped for a few days as I could not see why there was a major discrepancy.

I realised it was to do with how the original numerical target values are converted to factors for Caret. xgb.train() accepts a numerical y via the xgb.DMatrix() function and can be trained for regression or classification by setting the appropriate objective parameter. Whilst caret::train() trains the method for classification by accepting y as a factor.

The problem arises due to the ordering of the y factor levels, e.g. levels 'pos', 'neg' VS levels 'neg', 'pos'

Below is an example using the yeast4 data set from the imbalance package. As there needs to be sufficient number of instances and large enough class imbalance for the problem to become apparent.

library(caret)
library(dplyr)
library(xgboost)
library(ggplot2)
library(imbalance) # for yeast4 dataset.

summary(yeast4$Class)
# negative positive 
# 1433       51

pos_class_str <- 'positive' 
neg_class_str <- 'negative'

str(yeast4$Class)
# Factor w/ 2 levels "negative","positive": 1 1 1 1 1 1 1 1 1 1 ...

Refactor the Class and add a Class column that has the levels reversed.

data <- yeast4 %>%
  mutate( Class_int = if_else(Class == pos_class_str, 1L, 0L),
          Class = factor(x = Class_int, levels = c('1', '0'), labels = c( pos_class_str, neg_class_str)), # Reorder Class factor.
          Class_revLvls = factor(x = Class_int, levels = c('0', '1'), labels = c( neg_class_str, pos_class_str)),
          Erl = if_else(Erl == '1', 1L, 0L))

# Levels of the two targets are reversed to each other.
str(data$Class)
# Factor w/ 2 levels "positive","negative": 2 2 2 2 2 2 2 2 2 2 ...
str(data$Class_revLvls)
# Factor w/ 2 levels "negative","positive": 1 1 1 1 1 1 1 1 1 1 ...

Create the training and test sets.


set.seed(100)

# Split data for classification problems with Y as a factor.
msk <- caTools::sample.split(data$Class, SplitRatio=0.8)

# use output of sample.split to create train and test subsets
data_train <- data[msk,]
data_test <- data[!msk,]

# Calculate scale_pos_weight for xgboost to better handle unbalanced data.
class_count <- table(data_train$Class)
scale_pos_weight <- class_count[neg_class_str] / class_count[pos_class_str]
# 27.95122 
train.YClass <- data_train$Class
train.Y <-data_train$Class_int
train.X <- data_train %>% select(-Class, -Class_int, -Class_revLvls)

test.YClass <- data_test$Class
test.Y <- data_test$Class_int
test.X <- data_test %>% select(-Class, -Class_int, -Class_revLvls)

n_folds <- 10 #cross-validation

After using a grid search I found the following parameters for XGBoost to produce good results. YMMV. Also specify Caret training control.

# Final parameters
xgb.grid <- expand.grid(nrounds = 25, 
                        max_depth = 5,
                        eta = 0.01,
                        gamma = 0,
                        colsample_bytree = 1,
                        min_child_weight = 1,
                        subsample = 0.75
)

# 10-fold CV
ctrl <- trainControl( method = "cv",
                      number = n_folds,
                      summaryFunction = twoClassSummary,
                      classProbs = T,
                      savePredictions = 'final', 
                      returnResamp = 'final', 
                      verboseIter = T,
                      allowParallel = F # disable parallel processing for reproducibility.
)

First we'll train using y with Factor w/ 2 levels "positive","negative" ordering.

set.seed(seed = 1234)
caret_train <- train( x = train.X,
                      y = train.YClass,
                      preProcess = NULL,
                      method="xgbTree",
                      objective = 'binary:logistic',
                      eval_metric = "auc",
                      scale_pos_weight = scale_pos_weight,
                      metric='ROC',
                      nthread = 1,
                      maximize = T,
                      trControl = ctrl,
                      tuneGrid = xgb.grid )

Next we'll train using y with the level order reversed, i.e. Factor w/ 2 levels "negative","positive".

set.seed(seed = 1234)
caret_train_revLvls <- train( x = train.X,
                              y = data_train$Class_revLvls,
                              preProcess = NULL,
                              method="xgbTree",
                              objective = 'binary:logistic',
                              eval_metric = "auc",
                              scale_pos_weight = scale_pos_weight,
                              metric='ROC',
                              nthread = 1,
                              maximize = T,
                              # metric = "AUC", # using prSummary.
                              trControl = ctrl,
                              tuneGrid = xgb.grid)

Below is the performance comparison between the two runs.

caret_train$results %>% knitr::kable()
| nrounds| max_depth|  eta| gamma| colsample_bytree| min_child_weight| subsample|       ROC|  Sens|      Spec|     ROCSD|    SensSD|    SpecSD|
|-------:|---------:|----:|-----:|----------------:|----------------:|---------:|---------:|-----:|---------:|---------:|---------:|---------:|
|      25|         5| 0.01|     0|                1|                1|      0.75| 0.9082031| 0.505| 0.9197483| 0.0379994| 0.2692067| 0.0321794|

caret_train_revLvls$results %>% knitr::kable()
| nrounds| max_depth|  eta| gamma| colsample_bytree| min_child_weight| subsample|       ROC| Sens| Spec|     ROCSD| SensSD| SpecSD|
|-------:|---------:|----:|-----:|----------------:|----------------:|---------:|---------:|----:|----:|---------:|------:|------:|
|      25|         5| 0.01|     0|                1|                1|      0.75| 0.5324518|    1|    0| 0.0542748|      0|      0|

It is clear that the Sensitivity and Specificity values have reversed compared to the original caret trained XGboost model. As well as the SD is zero, ie no variation in performance over the 10 folds.

Next we can plot the Sensitivity vs Specificity over a range of decision thresholds to further illustrate the problem.

alphas <- seq(0.1,1,by=0.01)
test_pred <- cbind( pred = predict.train(caret_train, newdata = data_test, type = 'raw'),
                    obs = test.YClass,
                    predict.train(caret_train, newdata = data_test, type = 'prob'))

confusionmatrix_list <- lapply(alphas, function(x) caret::confusionMatrix( factor(if_else(test_pred[,pos_class_str] > x, 1, 0),
                                                                                      levels = c(1, 0), labels = c(pos_class_str, neg_class_str)),
                                                                               reference = data_test$Class,
                                                                               positive = pos_class_str))

test_pred_revLvls <- cbind( pred = predict.train(caret_train_revLvls, newdata = data_test, type = 'raw'),
                    obs = data_test$Class_revLvls,
                    predict.train(caret_train_revLvls, newdata = data_test, type = 'prob'))
confusionmatrix_revLvls <- lapply(alphas, function(x) caret::confusionMatrix( factor(if_else(test_pred_revLvls[,pos_class_str] > x, 1, 0),
                                                                                     levels = c(0, 1), labels = c(neg_class_str, pos_class_str)),
                                                                           reference = data_test$Class_revLvls,
                                                                           positive = pos_class_str))

head(test_pred) %>% knitr::kable()
# |pred     |obs      |  positive|  negative|
# |:--------|:--------|---------:|---------:|
# |negative |negative | 0.3897207| 0.6102793|
# |negative |negative | 0.3897207| 0.6102793|
# |negative |negative | 0.3897207| 0.6102793|
# |negative |negative | 0.3897207| 0.6102793|
# |negative |negative | 0.3897207| 0.6102793|
# |negative |negative | 0.3897207| 0.6102793|

head(test_pred_revLvls) %>% knitr::kable()
# |pred     |obs      |  negative|  positive|
# |:--------|:--------|---------:|---------:|
# |negative |negative | 0.6106911| 0.3893089|
# |negative |negative | 0.6106911| 0.3893089|
# |negative |negative | 0.6106911| 0.3893089|
# |negative |negative | 0.6106911| 0.3893089|
# |negative |negative | 0.6106911| 0.3893089|
# |negative |negative | 0.6106911| 0.3893089|

names(confusionmatrix_list) <- alphas
names(confusionmatrix_revLvls) <- alphas
confusionmatrix_summary <- data.frame()

# Extract CM stats for each decision threshold used.
for( alpha_str in names(confusionmatrix_list)){
  alpha <- as.numeric(alpha_str)

  cm <- confusionmatrix_list[[alpha_str]]
  cm_revLvls <- confusionmatrix_revLvls[[alpha_str]]

  confusionmatrix_summary <- rbind(confusionmatrix_summary, 
                                   cbind(alpha, model = 'Original', as.data.frame(t(cm$overall)), as.data.frame(t(cm$byClass))),
                                   cbind(alpha, model = 'Rev Lvls', as.data.frame(t(cm_revLvls$overall)), as.data.frame(t(cm_revLvls$byClass))))

}

g_sens_vs_spec <- confusionmatrix_summary %>% 
  ggplot(aes(group = model, colour = model)) +
  geom_point(aes(x = alpha, y = Sensitivity, shape = 'Sensitivity' ), size = 5) +
  geom_line(aes(x = alpha, y = Sensitivity, )) +
  geom_point(aes(x = alpha, y = Specificity,shape = 'Specificity' ), size = 5) +
  geom_line(aes(x = alpha, y = Specificity,  )) +
  labs(title = paste0('Sensitivity vs Specificity'), x = 'alpha (decision threshold)', y = 'Score', colour = 'Model', shape = 'Metric' )

g_sens_vs_spec

sens_vs_spec

The plot shows the original Caret trained XGBoost model with Sens/Spec forming the usual crossover pattern, see for example. Whilst the Sens/Spec plot for the caret_train_revLvls model has a hard crossover, which is not expected.

I guess this problem occurs because CARET is converting the y factors to numerical values assuming the first level is the positive class. However, it only becomes apparent on unbalanced data sets.

It would be great if someone could verify this behaviour, or I'm imagining it.

A simple work around is to ensure the first level of the y factor is the positive class. As illustrated in the above example.

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Gentoo/Linux

Matrix products: default
BLAS:   /usr/lib64/blas/openblas/libblas.so.3
LAPACK: /usr/lib64/libopenblas_haswellp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=C               LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] imbalance_1.0.2.1 xgboost_1.0.0.2   dplyr_0.8.5       caret_6.0-86      ggplot2_3.3.0     lattice_0.20-41  

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0     xfun_0.13            purrr_0.3.4          reshape2_1.4.4       splines_3.6.3        colorspace_1.4-1     vctrs_0.3.0          generics_0.0.2      
 [9] stats4_3.6.3         survival_3.1-12      prodlim_2019.11.13   rlang_0.4.6          ModelMetrics_1.2.2.2 e1071_1.7-3          pillar_1.4.4         glue_1.4.0          
[17] withr_2.2.0          smotefamily_1.3.1    foreach_1.5.0        lifecycle_0.2.0      plyr_1.8.6           lava_1.6.7           stringr_1.4.0        timeDate_3043.102   
[25] munsell_0.5.0        gtable_0.3.0         recipes_0.1.12       caTools_1.18.0       codetools_0.2-16     labeling_0.3         knitr_1.28           class_7.3-17        
[33] highr_0.8            Rcpp_1.0.4.6         scales_1.1.1         ipred_0.9-9          farver_2.0.3         digest_0.6.25        packrat_0.5.0        stringi_1.4.6       
[41] grid_3.6.3           tools_3.6.3          bitops_1.0-6         magrittr_1.5         tibble_3.0.1         crayon_1.3.4         pkgconfig_2.0.3      MASS_7.3-51.6       
[49] ellipsis_0.3.0       Matrix_1.2-18        data.table_1.12.8    pROC_1.16.2          lubridate_1.7.8      gower_0.2.1          assertthat_0.2.1     rstudioapi_0.11     
[57] iterators_1.0.12     R6_2.4.1             rpart_4.1-15         nnet_7.3-14          nlme_3.1-147         compiler_3.6.3      
davidhaan commented 2 years ago

While I did not try to reproduce your bug, I have run into problems using xgboost and caret. Instead I wrote my own wrappers around XGBoost and avoid caret altogether.

JustGitting commented 2 years ago

Thank you @davidhaan

Do you mind sharing your wrapper or any tips?

efh0888 commented 2 years ago

@JustGitting I've just run into what I think is the same issue. I'm working on a credit risk model (similarly imbalanced at about 10:1) and found the cross-validated AUROC provided by caret to be substantially worse than the test set performance (on which I calculate AUROC myself and know the calculation to be reliable).

I suppose the best way for me to test if it's the same problem is to fit the XGB model again w/ the tuned parameters on the response w/ the levels reversed?

Update: I did the test as described and sure enough once reversing the levels of the response, the ROC is now in line w/ what I'm seeing in the test set. Great news!