topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 636 forks source link

Locale ruins decimal in tuneGrid for xgbTree #1251

Open g1o opened 2 years ago

g1o commented 2 years ago

I had a very strange problem. Train of the xgbTree worked with a clean session, but the second time I tried to run, after a while, it failed.

Turns out that something changed my locale, and then the eta parameter was read as "0,1" when using my locale instead of "0.1". When using eta = 1, without decimal, it worked. Things were solved by setting the locale LC_NUMERIC to 'C' (_Sys.setlocale("LCNUMERIC", 'C' )) , so that it would use dot as a decimal separator.

### CODE

library(caret)
set.seed(1)
dat <- twoClassSim(100)

Sys.setlocale("LC_NUMERIC", 'pt_BR.UTF-8' )

egrid <-
  expand.grid(
    nrounds = c(100, 200, 500),
    max_depth = c(4, 10),
    colsample_bytree = 1,
    eta = (1 / 10) ,
    gamma = 1,
    min_child_weight = 1,
    subsample = 1
  )

control <-
  trainControl(
    method = "cv",
    number = 2,
    classProbs = TRUE,
    summaryFunction = twoClassSummary,
    savePredictions = F,
    preProcOptions = NULL
  )

xgbt_test <-
  train(
    Class ~ .,
    data =  dat  ,
    metric = "ROC",
    method = "xgbTree",
    trControl = control,
    tuneGrid = egrid ,
    nthread = 1
  )

Something is wrong; all the ROC metric values are missing:
      ROC           Sens          Spec    
 Min.   : NA   Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA   Max.   : NA  
 NA's   :6     NA's   :6     NA's   :6    
Error: Stopping
In addition: Warning messages:
1: model fit failed for Fold1: eta=0,1, max_depth= 4, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Some trailing characters could not be parsed: ',1'

2: model fit failed for Fold1: eta=0,1, max_depth=10, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Some trailing characters could not be parsed: ',1'

3: model fit failed for Fold2: eta=0,1, max_depth= 4, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Some trailing characters could not be parsed: ',1'

4: model fit failed for Fold2: eta=0,1, max_depth=10, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Some trailing characters could not be parsed: ',1'

5: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

### FIX
Sys.setlocale("LC_NUMERIC", 'C' )

xgbt_test <-
  train(
    Class ~ .,
    data = dat ,
    metric = "ROC",
    method = "xgbTree",
    trControl = control,
    tuneGrid = egrid ,
    nthread = 1
  ) #no warnings now.

### Session Info:
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /mnt/DATABASES/anaconda3/envs/giovannimc/lib/libmkl_rt.so.1

locale:
 [1] LC_CTYPE=pt_BR.UTF-8          LC_NUMERIC=pt_BR.UTF-8
 [3] LC_TIME=en_GB.UTF-8           LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8       LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=pt_BR.UTF-8          LC_NAME=pt_BR.UTF-8
 [9] LC_ADDRESS=pt_BR.UTF-8        LC_TELEPHONE=pt_BR.UTF-8
[11] LC_MEASUREMENT=pt_BR.UTF-8    LC_IDENTIFICATION=pt_BR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pROC_1.16.2                 GeneEssentiality_1.0.1.1000
[3] PRROC_1.3.1                 caret_6.0-86
[5] ggplot2_3.3.2               lattice_0.20-38

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           pillar_1.4.6         compiler_3.6.1
 [4] gower_0.2.2          plyr_1.8.6           iterators_1.0.12
 [7] class_7.3-15         tools_3.6.1          rpart_4.1-15
[10] ipred_0.9-9          lubridate_1.7.9      lifecycle_0.2.0
[13] tibble_3.0.3         nlme_3.1-139         gtable_0.3.0
[16] pkgconfig_2.0.3      rlang_0.4.7          Matrix_1.2-17
[19] foreach_1.5.0        prodlim_2019.11.13   e1071_1.7-3
[22] ranger_0.12.1        stringr_1.4.0        withr_2.2.0
[25] dplyr_1.0.0          generics_0.0.2       vctrs_0.3.2
[28] recipes_0.1.13       xgboost_1.1.1.1      stats4_3.6.1
[31] grid_3.6.1           nnet_7.3-12          tidyselect_1.1.0
[34] data.table_1.13.0    glue_1.4.1           R6_2.4.1
[37] survival_2.44-1.1    lava_1.6.7           reshape2_1.4.4
[40] purrr_0.3.4          magrittr_1.5         ModelMetrics_1.2.2.2
[43] scales_1.1.1         codetools_0.2-16     ellipsis_0.3.1
[46] MASS_7.3-51.3        splines_3.6.1        randomForest_4.6-14
[49] timeDate_3043.102    colorspace_1.4-1     stringi_1.4.6
[52] munsell_0.5.0        crayon_1.3.4
topepo commented 2 years ago

Sorry. That must have taken forever to figure out.

For caret, we just pass off the data to xgboost (no parsing on our side). For your first example, just before the model is fit, the data are in a proper format (stored as numeric but printed as "0,1":

Browse[2]> tuneValue
  eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
1 0,1         4     1                1                1         1     500
Browse[2]> str(tuneValue)
'data.frame':   1 obs. of  7 variables:
 $ eta             : num 0,1
 $ max_depth       : num 4
 $ gamma           : num 1
 $ colsample_bytree: num 1
 $ min_child_weight: num 1
 $ subsample       : num 1
 $ nrounds         : num 500

I hate to pass you off to someone else, but I think that this has to be fixed by xgboost.