topepo / FES

Code and Resources for "Feature Engineering and Selection: A Practical Approach for Predictive Models" by Kuhn and Johnson
https://bookdown.org/max/FES
GNU General Public License v2.0
716 stars 237 forks source link

different output for chapter 11.2 initial ROC analysis #103

Closed LuisCioffi closed 2 years ago

LuisCioffi commented 2 years ago

Hello everyone and congratulations for such a useful and complete book

In chapter 11.2 there is a phrase about an initial screening on AUC for all predictors in the training set :

The initial analysis of the training set showed that there were 5 predictors with an area under the ROC curve of at least 0.80 and 21 were between 0.75 and 0.80.

However when i tried to replicate this part i found 21 predictors with 0.75-0.80 AUC but only 2 ( not 5) with > 0.80 AUC

Since i can't find this part in the code released, could you please check my code and tell if i'm doing something wrong or different from your code?

Thank you and regards,

Luis

library(caret)
library(tidymodels)
library(pROC)
library(dplyr)
library(recipes)
library(tidyverse)
library(rsample)

pd_speech <-
  read_csv("pd_speech_features.csv", skip = 1) %>%
  group_by(id) %>%
  summarise_all(mean) %>%
  mutate(
    class = factor(
      ifelse(class == 1, "PD", "control"),
      levels = c("PD", "control")
    )
  ) %>%
  dplyr::select(-id,-male, -numPulses, -numPeriodsPulses, male = gender, Class = class)

set.seed(825)
pd_split <- initial_split(pd_speech, prop = 0.75)

pd_tr <- training(pd_split)
pd_te <-  testing(pd_split)

Initial_ROC <- filterVarImp(x = pd_tr[, -ncol(pd_tr)], y = pd_tr$Class) 

Initial_ROC %>% dplyr::filter(PD >= 0.8)%>% count() #2 features > 0.8 ROC
Initial_ROC %>% dplyr::filter(PD > 0.75  & PD < 0.8)%>% count() #21 features
LuisCioffi commented 2 years ago

Closed as the difference is probably due to a different seed in the training/test split.