mayer79 / flashlight

Machine learning explanations
https://mayer79.github.io/flashlight/
GNU General Public License v2.0
22 stars 4 forks source link

binary classification error all(predicted == 0 | predicted == 1) is not TRUE #49

Closed verajosemanuel closed 3 years ago

verajosemanuel commented 3 years ago

Created a working model with xgboost tidymodels workflow. Checked if the model predicts accordingly. All seems ok. We get probability of conversion for a bunch of customer visitors to our page. All target factors are present and mutated to TRUE/FALSE

We get an error when plotting performance

mydata <- customer_data %>% dplyr::mutate(convert= if_else(convert== "converted", TRUE,FALSE))
fl <- flashlight(
   model = xgb_fit,
   label = "conversion",
   y = "convert",
   data = mydata,
   metrics = list(AUC = AUC, f1_score = f1_score, recall = recall, logLoss = logLoss),
   predict_function = function(model, data) predict(model, data, type = "prob")$.pred_converted
 )

fl

Flashlight conversion 

Model:           Yes
y:           convert
w:           No
by:          No
data dim:        53958 28
predict_fct default:     FALSE
linkinv default:     TRUE
metrics:         AUC f1_score recall logLoss
SHAP:            No

 light_performance(fl) %>% 
   plot(fill = "orange") +
   xlab(element_blank())

Error: Problem with `summarise()` input `..1`.
i `..1 = core_fun(cur_data())`.
x all(predicted == 0 | predicted == 1) is not TRUE
i The error occurred in group 1: label = "conversion".

We reproduced the error with vignette example just setting iris (ir) virginica column to virginica/notvirginica

Thanks

mayer79 commented 3 years ago

Hi. logLoss is defined only for predictions > 0 and < 1. In your case, they are all 0 or 1. Can you try to remove this metric and see if the code works?

verajosemanuel commented 3 years ago

Same error, sorry. In fact, I get a probability of conversion instead a class.

verajosemanuel commented 3 years ago

Solved tweaking the predict function to return 1/0 given a probability threshold


my_predict = function (model,data) {
  preds <- predict(model,data, type = "prob")
  preds %<>% dplyr::mutate(.pred_conversion = if_else(.pred_conversion > 0.70, 1, 0))
  return(preds$.pred_conversion)
}
mayer79 commented 3 years ago

Sweet! But then, the first approach might have failed because recall is only defined for predictions in {0, 1}. Normally it is better to work with probabilities, so you can try out the original approach with the "right" metrics (e.g. logloss and AUC).