topepo / C5.0

An R package for fitting Quinlan's C5.0 classification model
https://topepo.github.io/C5.0/
50 stars 20 forks source link

Predictions cause error when model was fit using x/y instead of formula #28

Closed jaredlander closed 3 years ago

jaredlander commented 4 years ago

When using the formula interface for C5.0() everything works as expected. But when using the x and y arguments predictions do not work. A Stackoverflow question from earlier this year came to the same conclusion.

Here is some code to illustrate.

library(dplyr)
library(C50)
library(rsample)
library(recipes)

data(credit_data,package='modeldata')
credit <- tibble::as_tibble(credit_data) %>% mutate(across(where(is.factor), as.character))

set.seed(28676)
data_split <- initial_split(credit, prop=.9, strata='Status')
train <- training(data_split)
test <- testing(data_split)

rec_C50 <- recipe(Status ~ ., data=train) %>% 
    themis::step_upsample(Status) %>% 
    step_other(all_nominal(), -Status, other='misc')
prep_c50 <- rec_C50 %>% prep()

train_data <- prep_c50 %>% juice()
test_data <- prep_c50 %>% bake(new_data=test)

# this works as expected
c5_formula <- C5.0(Status ~ ., data=prep_c50 %>% juice())
preds_formula <- predict(c5_formula, newdata=prep_c50 %>% bake(new_data=test, all_predictors()))
head(preds_formula)

# this causes an error
c5_xy <- C5.0(x=prep_c50 %>% juice(all_predictors()), y=prep_c50 %>% juice(Status) %>% pull(Status))
preds_xy <- predict(c5_xy, newdata=prep_c50 %>% bake(new_data=test, all_predictors()))
Error: 
*** line 1 of `undefined.cases': bad value of `c(3, 2, 2, 2, 5, 1, 2, 5, 2, 3, 1, 1, 2, 2, 1, 3, 2, 5, 5, 4, 2, 4, 2, 2, 2, 3, 2, 2, 2, 5, 2, 1, 2, 4, 3, 3, 2, 2, 2, 5, 5, 1, 2, 3, 5, 5, 4, 1, 4, 3, 2, 2, 3, 2, 1, 2, 5, 5, 2, 2, 2, 2, 2, 5, 2, 5, 2, 3, 2, 2, 2, 5, 2, 2, 1, 2, 2, 2, 2, 3, 2, 2, 3, 5, 2, 5, 5, 2, NA, 2, 3, 2, 2, 2, 5, 2, 3, 1, 3, 6, 2, 2, 3, 5, 5, 5, 2, 2, 2, 2, 4, 3, 3, 5, 2, 2, 3, 6, 2, 2, 4, 1, 3, 2, 2, 3, 3, 5, 2, 2, 1, 2, 2, 2, 3, 2, 1, 1, 4, 2, 2, 4, 4, 3, 2, 5, 2, 2, 2, 3, 2, 5, 2, 1, 2, 2, 5, 2, 2, 1, 2, 3, 5, 2, 5, 3,' for attribute `Home'

Error limit exceeded

Interestingly, when fitting using {workflows}, predictions work for an untuned boost_tree() model and for a tuned or untuned decision_tree() model. But this error occurs when trying to tune a boost_tree() model.

To make matters worse the {C5.0} website shows this error in the documentation for the predict() function as seen in the image below.

image

topepo commented 3 years ago

I think that this works if you install the current dev version of Cubist. It was an issue with how lappy() works with tibbles.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(C50)
library(rsample)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

data(credit_data,package='modeldata')
credit <- tibble::as_tibble(credit_data) %>% mutate(across(where(is.factor), as.character))

set.seed(28676)
data_split <- initial_split(credit, prop=.9, strata='Status')
train <- training(data_split)
test <- testing(data_split)

rec_C50 <- recipe(Status ~ ., data=train) %>% 
    themis::step_upsample(Status) %>% 
    step_other(all_nominal(), -Status, other='misc')
#> Warning: replacing previous import 'data.table:::=' by 'ggplot2:::=' when
#> loading 'mlr'
#> Registered S3 methods overwritten by 'themis':
#>   method                  from   
#>   bake.step_downsample    recipes
#>   bake.step_upsample      recipes
#>   prep.step_downsample    recipes
#>   prep.step_upsample      recipes
#>   tidy.step_downsample    recipes
#>   tidy.step_upsample      recipes
#>   tunable.step_downsample recipes
#>   tunable.step_upsample   recipes
prep_c50 <- rec_C50 %>% prep()

train_data <- prep_c50 %>% juice()
test_data <- prep_c50 %>% bake(new_data=test)

# this works as expected
c5_formula <- C5.0(Status ~ ., data=prep_c50 %>% juice())
preds_formula <- predict(c5_formula, newdata=prep_c50 %>% bake(new_data=test, all_predictors()))
head(preds_formula)
#> [1] good good good good bad  good
#> Levels: bad good

# this causes an error
c5_xy <- C5.0(x=prep_c50 %>% juice(all_predictors()), y=prep_c50 %>% juice(Status) %>% pull(Status))
preds_xy <- predict(c5_xy, newdata=prep_c50 %>% bake(new_data=test, all_predictors()))

Created on 2021-05-06 by the reprex package (v1.0.0.9000)