topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

hdrda gives me problems #922

Closed crj32 closed 5 years ago

crj32 commented 6 years ago

Dear Caret maintainers

The following does not happen with other models, e.g. random forest. I want a ROC curve from HDRDA, and cannot get it, simple accuracy works OK. This function seems a bit dodgy compared with all the others I use which work v. well.

Thanks.

Error in [.data.frame(data, , lvls[1]) : undefined columns selected

Minimal, reproducible example:

Minimal dataset:

library(caret)
random <- matrix(rexp(20000, rate=.1), ncol=20)
response <- rep(c('a','b'),10)
data <- as.data.frame(t(random))
colnames(data) <- paste('A',seq(1,ncol(data)),sep='')
data$outcome <- response

Minimal, runnable code:

ctrl <- trainControl(method = 'LOOCV',
                     summaryFunction=twoClassSummary, 
                     classProbs=T,
                     savePredictions = T)
txfit4 <- train(outcome ~ .,data=data,
                method="hdrda", 
                trControl=ctrl, metric = "ROC")

traindata <- data[,1:1000]
trainclasses <- data[,1001]

txfit5 <- train(traindata,trainclasses,
                method="hdrda", 
                trControl=ctrl, metric = "ROC")

Session Info:

>sessionInfo()
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_0.9.2     gridExtra_2.3     doParallel_1.0.11 iterators_1.0.9   foreach_1.4.4     edgeR_3.20.9     
 [7] caret_6.0-80      lattice_0.20-35   ranger_0.10.1     plotROC_2.2.0     pROC_1.12.1       ggplot2_3.0.0    
[13] limma_3.34.9     

loaded via a namespace (and not attached):
 [1] magic_1.5-8         ddalpha_1.3.3       tidyr_0.8.1         sfsmisc_1.1-2       splines_3.4.4       prodlim_2018.04.18 
 [7] Formula_1.2-3       assertthat_0.2.0    stats4_3.4.4        sparsediscrim_0.2.4 DRR_0.0.3           robustbase_0.93-0  
[13] ipred_0.9-6         pillar_1.2.3        glue_1.2.0          randomForest_4.6-14 colorspace_1.3-2    recipes_0.1.2      
[19] Matrix_1.2-14       plyr_1.8.4          psych_1.8.4         timeDate_3043.102   pkgconfig_2.0.1     CVST_0.2-2         
[25] broom_0.4.4         corpcor_1.6.9       mvtnorm_1.0-8       purrr_0.2.5         scales_0.5.0        RSpectra_0.13-1    
[31] gower_0.1.2         lava_1.6.1          Cubist_0.2.2        tibble_1.4.2        withr_2.1.2         HDclassif_2.1.0    
[37] nnet_7.3-12         lazyeval_0.2.1      mnormt_1.5-5        survival_2.42-3     magrittr_1.5        nlme_3.1-137       
[43] MASS_7.3-50         dimRed_0.1.0        foreign_0.8-70      class_7.3-14        tools_3.4.4         stringr_1.3.1      
[49] kernlab_0.9-26      munsell_0.4.3       glmnet_2.0-16       locfit_1.5-9.1      bindrcpp_0.2.2      compiler_3.4.4     
[55] e1071_1.6-8         inum_1.0-0          RcppRoll_0.3.0      rlang_0.2.1         C50_0.1.2           partykit_1.2-2     
[61] geometry_0.3-6      gtable_0.2.0        ModelMetrics_1.1.0  codetools_0.2-15    rARPACK_0.11-0      abind_1.4-5        
[67] reshape2_1.4.3      R6_2.2.2            lubridate_1.7.4     bdsmatrix_1.3-3     dplyr_0.7.5         libcoin_1.0-1      
[73] bindr_0.1.1         stringi_1.2.2       Rcpp_0.12.17        rpart_4.1-13        DEoptimR_1.0-8      tidyselect_0.2.4
hadjipantelis commented 6 years ago

The package sparsediscrim was removed from CRAN. One cannot readily check the behaviour you report because of this.

Is it paramount you use hdrda ? If you have time to spare/invest/waste rda and rrlda are available if you want a regularised discriminant analysis, they will probably be much slower than hdrda but should provide very similar results.

I suspect that the error you see is due to hdrda not working well with single item input arguments during predictions.... You could probably use method = 'CV' and define number = floor(N*0.5) where N is the number of points in your dataset. This will effectively cause the training procedure to be a Leave-Two-Out-Cross-Validation and should in theory take care of any problem with the output. That said, maybe using a Random Forest (e.g. method = 'ranger') will save you all this trouble. :)

crj32 commented 6 years ago

The floor thing does not get rid of the error unfortunately. It would just be nice because it should perform a bit better than HDDA and we are working with very high dimensional data. It is OK though if it is too tricky to debug, thanks anyway.

hadjipantelis commented 6 years ago

Hmm... I would wait till the package reappears on CRAN or learn why it was removed to begin with... I checked the binary manually and it seems alright. I will make a manual install over the weekend and let you know if I can at least reproduce this issue.

topepo commented 6 years ago

The issues is with the hdrda predict method. For a data frame with >1 rows, you get a nice data frame back but with a single row, it returns a numeric vector. I'm pretty sure that this is new because I specifically test for this when developing the model code and have a lot of regression test cases for this.

It could be changed in the predict module but I'm not inclined to spend the time if the package is orphaned. That isn't an indicator of poor quality but is probably more related to some new CRAN restriction/rule.

Oddly, it was orphaned on 2018-07-20 but the last official set of checks look fine, on 2018-08-17 for OS X, are okay. I would guess at some arcane gcc issues (on windows maybe).

An issue was opened two weeks ago so you could ask the developer.

I'll file an issue for this.

topepo commented 5 years ago

I don't think that the package is being supported 😩 I'll close this but please reopen if that changes.