topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

Penalized logistic regression using package penalized #865

Closed paullemmens closed 6 years ago

paullemmens commented 6 years ago

Currently the way that the penalized fit is built prohibits using the packages optional logistic regression because the fit function fixes the model parameter of penalized::penalized() at 'linear'.

getModelInfo('penalized')
## excerpt
$penalized$fit
function (x, y, wts, param, lev, last, classProbs, ...) 
{
    penalized::penalized(y, x, model = "linear", lambda1 = param$lambda1, 
        lambda2 = param$lambda2, ...)
}

With custom models (as described in the manual) this should be fixable.

penalized.fit <- function (x, y, wts, param, lev, last, classProbs, ...)
{
    penalized::penalized(y, x, model = "logistic", lambda1 = param$lambda1,
        lambda2 = param$lambda2, ...)
}
penalized.custom <- getModelInfo('penalized', regex = FALSE)[[1]]
penalized.custom$fit <- penalized.fit
penalized.custom$type <- 'Classification'
fit.control <- trainControl(method = 'repeatedcv', number = 10, repeats = 0)
fit1 <- train(y = shuti.training$isi.remitter, x = shuti.training[, preds],
              method = penalized.custom,
              trControl = fit.control)#, model = 'logistic')

This seems to run fine, but apparently the results that caret expects are not delivered:

# nonzero coefficients: 12          
# nonzero coefficients: 10          
# nonzero coefficients: 8          
Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :9     NA's   :9    
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Setting row names on a tibble is deprecated.
2: In data.frame(pred = predicted, obs = y[holdoutIndex],  ... :
  row names were found from a short variable and have been discarded

The warnings in this case seem to be fairly innocuous. I've check that the error is in caret's code.

I'm aware the plr (also) implements penalized logistic regression but acts a bit wonky (on my data?) so I wanted to double check using penalized.

Am I missing something in coding the custom model?

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] caret_6.0-78       lattice_0.20-35    doParallel_1.0.11  iterators_1.0.9   
 [5] foreach_1.4.4      GGally_1.3.2       bindrcpp_0.2       readxl_1.0.0      
 [9] lemmens_0.0.0.1000 forcats_0.3.0      stringr_1.3.0      dplyr_0.7.4       
[13] purrr_0.2.4        readr_1.1.1        tidyr_0.8.0        tibble_1.4.2      
[17] ggplot2_2.2.1      tidyverse_1.2.1   

loaded via a namespace (and not attached):
 [1] nlme_3.1-131.1     lubridate_1.7.3    dimRed_0.1.0       RColorBrewer_1.1-2
 [5] progress_1.1.2     httr_1.3.1         rprojroot_1.3-2    tools_3.4.3       
 [9] backports_1.1.2    R6_2.2.2           rpart_4.1-13       lazyeval_0.2.1    
[13] questionr_0.6.2    colorspace_1.3-2   nnet_7.3-12        withr_2.1.1.9000  
[17] tidyselect_0.2.4   prettyunits_1.0.2  mnormt_1.5-5       compiler_3.4.3    
[21] cli_1.0.0          rvest_0.3.2        penalized_0.9-50   xml2_1.2.0        
[25] labeling_0.3       bookdown_0.7       scales_0.5.0.9000  sfsmisc_1.1-2     
[29] DEoptimR_1.0-8     psych_1.7.8        robustbase_0.92-8  digest_0.6.15     
[33] foreign_0.8-69     rmarkdown_1.9      stepPlr_0.93       pkgconfig_2.0.1   
[37] htmltools_0.3.6    highr_0.6          rlang_0.2.0.9000   ddalpha_1.3.1.1   
[41] rstudioapi_0.7     shiny_1.0.5        bindr_0.1          jsonlite_1.5      
[45] ModelMetrics_1.1.0 magrittr_1.5       Matrix_1.2-12      Rcpp_0.12.15      
[49] munsell_0.4.3      stringi_1.1.6      yaml_2.1.17        MASS_7.3-49       
[53] plyr_1.8.4         recipes_0.1.2      grid_3.4.1         crayon_1.3.4      
[57] miniUI_0.1.1       splines_3.4.3      haven_1.1.1        hms_0.4.1         
[61] knitr_1.20         pillar_1.2.1       stats4_3.4.3       reshape2_1.4.3    
[65] codetools_0.2-15   CVST_0.2-1         glue_1.2.0         evaluate_0.10.1   
[69] modelr_0.1.1       rmdformats_0.3.4   httpuv_1.3.6.2     cellranger_1.1.0  
[73] gtable_0.2.0       reshape_0.8.7      kernlab_0.9-25     assertthat_0.2.0  
[77] DRR_0.0.3          xfun_0.1           gower_0.1.2        prodlim_1.6.1     
[81] mime_0.5           xtable_1.8-2       broom_0.4.3        e1071_1.6-8       
[85] survival_2.41-3    class_7.3-14       timeDate_3043.102  RcppRoll_0.2.2    
[89] lava_1.6           ipred_0.9-6       
topepo commented 6 years ago

A few other parts would need to be modified to make that note. If you were interested in doing a PR:

paullemmens commented 6 years ago

I'm going to have a look at the glmnet code (which I've used successfully for both regression and classification) and build upon that to get penalized working. Will let you know the progress. Thanks for your suggestion and feedback!

hadjipantelis commented 6 years ago

I looked into this and unfortunately this is going to be very messy.

I think that as odd as it sounds a fitted object from a penalized logistic regression model does not have the original labels associated with the logistic regression. For example:

 fit_penali <- penalized(penalized = nki70[ ,c("DIAPH3", "NUSAP1")], response = nki70[, "ER"], 
                         model=  "logistic" )

returns an object of class penfit which has no mention to classnames, labels any other classification related point (the nki70 is a dataset in penalized). In addition as the penfit is an S4 class we cannot just append another attribute $classnames, it needs a formal slot. Response-wise the fitted object contains only the logistic and the linear predictor values. The package is orphaned in CRAN so we cannot probably just ask for a newer version... I really struggle to see how functions like confusionMatrix will ever work without some extensive re-writing. I will be very happy to be proven wrong here!

I have written a wrapper but it is incomplete as it cannot get class labels. @paullemmens, if you want, e-mail the original authors of the package and ask for this change (through a relevant package update).

I have done the fit but without the change mentioned the predict is going to be a problem.

fit = function(x, y, wts, param, lev, last, classProbs, ...) {  
                     if(is.factor(y)) {
                       modeltype <- "logistic"
                     } else { 
                       modeltype <- "linear"
                     } 
                     modelArgs <- list(response = y,
                                       penalized = x,
                                       lambda1 = param$lambda1,
                                       lambda2 = param$lambda2) 
                     inArgs <- list(...)
                     if(!any(names(inArgs) == "model")){
                       modelArgs$model <- modeltype
                     }
                     modelArgs <- c(modelArgs, inArgs) 
                     out <- do.call(penalized::penalized, modelArgs)
                     out
                   }
topepo commented 6 years ago

If it is orphaned, I'm not sure that we should go to the trouble, especially of there are other implementations or regularized logistic regression.

About the S4 model and class levels... yeah there are a bunch of packages that (astoundingly) don't use the class levels (must. not. start. rant). The potential solution is to encapsulate the model in a list that also contains the class levels. The predict and others can work off if the sub-element of the list. That's a hack but ¯_(ツ)_/¯

hadjipantelis commented 6 years ago

I considered this workaround too but I was uncertain if you would be OK with it as it is a bit dodgy... I will look this up in the next couple of weeks.

paullemmens commented 6 years ago

Sorry for the late reply. Agreed. Let's close/drop this one.