topepo / C5.0

An R package for fitting Quinlan's C5.0 classification model
https://topepo.github.io/C5.0/
50 stars 20 forks source link

Tidy rules #16

Closed talegari closed 3 years ago

talegari commented 6 years ago

Hi Max,

I end up using C5 often for its speed and rules. Thanks for the package!

Although the summary function prints the rules in a handy way, it might be sometimes preferable to have them in a tidy way.

Rules displayed on calling summary function:

Rules:

Rule 1: (50, lift 2.9)
    Petal.Length <= 1.9
    ->  class setosa  [0.981]

Rule 2: (48/1, lift 2.9)
    Petal.Length > 1.9
    Petal.Length <= 4.9
    Petal.Width <= 1.7
    ->  class versicolor  [0.960]

Rule 3: (46/1, lift 2.9)
    Petal.Width > 1.7
    ->  class virginica  [0.958]

Rule 4: (46/2, lift 2.8)
    Petal.Length > 4.9
    ->  class virginica  [0.938]

Default class: setosa

Output of the tidying function:

support confidence lift LHS RHS n_conditions
50 1.0000000 2.94231 Petal.Length < 1.9 setosa 1
48 0.9791667 2.88000 Petal.Length > 1.9 & Petal.Length < 4.9000001 & Petal.Width < 1.7 versicolor 3
46 0.9782609 2.87500 Petal.Width > 1.7 virginica 1
46 0.9565217 2.81250 Petal.Length > 4.9000001 virginica 1

Note that the LHS is string parseable as a R expression. Hence, it can be simply pasted into dplyr::filter.

Here is the code to tidy the rules and the code snippet to run an example.

source("https://gist.github.com/talegari/dde1bc3aaed88533bcf7ee137296830a/raw/9bfc1fe894428b21ad2c94dd2d83b6277470dd60/tidy_rules_C5")

dplyr::glimpse(iris)
model <- C50::C5.0(Species ~ ., data = iris, rules = TRUE) # build a C5 model
summary(model)                                             # print rules

tidy_rules(model) %>% knitr::kable()

Please suggest if it might be a good idea to include this in broom package instead of here. Else, let me know if you are open for a PR.

Suggestions are welcome!

Regards, Srikanth KS

topepo commented 6 years ago

This looks great! I PR would be very welcome.

One thing: cubist has the same rule format (but a different model structure in the terminal nodes/rules). It would make sense to have this work for both. Can you adapt it to work with that package? I've been putting any joint infrastructure functions in Cubist (e.g. makeDataFile etc). If this were put in C50, it would make circular dependencies.

talegari commented 6 years ago

Thanks Max.

Cubist output can be processed similarly. I will write a function to create a tidy dataframe for that and submit a PR there.

About the design:

  1. Should we have a S3 generic tidy_rules with tidy_rules.C50 and tidy_rules.Cubist methods?
  2. Or two different functions named tidy_rules going into both packages separately?

Please suggest.

topepo commented 6 years ago

With other functions that they share, I add them to Cubist then import from from there into C50.

So try to write the function so that they share as much common code as possible then add those common functions and the tidy_rules generic to Cubist. Then C50 can import the class and have its own tidy_rules method.

talegari commented 6 years ago

@topepo Please review this draft before PR submission.

  1. Please suggest changes in column names and order if necessary
  2. I am not able to get cases where no rules are generated. If you have an example for that, I will cover that edge case.
topepo commented 6 years ago

It looks really good.

Some recommendations/comments:

> cubist(x = train_pred[, 1:2], y = train_resp)$model %>% cat()
id="Cubist 2.07 GPL Edition 2018-09-01"
prec="1" globalmean="22.41485" extrap="1" insts="0" ceiling="95" floor="0"
att="outcome" mean="22.41" sd="9.284727" min="5" max="50"
att="crim" mean="3.789463" sd="8.553482" min="0.00906" max="88.9762"
att="zn" mean="11.38" sd="23.47519" min="0" max="100"
entries="1"
rules="2"
conds="1" cover="54" mean="12.27" loval="5" hival="27.9" esterr="3.96"
type="2" att="crim" cut="9.2322998" result=">"
coeff="13.25" att="crim" coeff="-0.11" att="zn" coeff="0.009"
conds="1" cover="350" mean="23.98" loval="8.1" hival="50" esterr="5.59"
type="2" att="crim" cut="9.2322998" result="<="
coeff="21.79" att="crim" coeff="-0.62" att="zn" coeff="0.105"
> cubist(x = train_pred[, 1:2], y = train_resp) %>% summary()

Call:
cubist.default(x = train_pred[, 1:2], y = train_resp)

Cubist [Release 2.07 GPL Edition]  Sat Sep  1 17:43:44 2018
---------------------------------

    Target attribute `outcome'

Read 404 cases (3 attributes) from undefined.data

Model:

  Rule 1: [54 cases, mean 12.27, range 5 to 27.9, est err 3.96]

    if
    crim > 9.2323
    then
    outcome = 13.25 - 0.11 crim + 0.009 zn

  Rule 2: [350 cases, mean 23.98, range 8.1 to 50, est err 5.59]

    if
    crim <= 9.2323
    then
    outcome = 21.79 - 0.62 crim + 0.105 zn

Evaluation on training data (404 cases):

    Average  |error|               5.65
    Relative |error|               0.85
    Correlation coefficient        0.50

    Attribute usage:
      Conds  Model

      100%   100%    crim
             100%    zn

Time: 0.0 secs

Some testing code:

library(Cubist)
library(AmesHousing)
library(tidymodels)

ames <- make_ames()

ames2 <- 
  ames %>%
  dplyr::rename(`Gr Liv Area` = Gr_Liv_Area) %>%
  mutate(
    Overall_Qual = gsub("_", " ", as.character(Overall_Qual)),
    MS_SubClass = gsub("_", " ", as.character(MS_SubClass))
    )

cb_mod <- 
  cubist(
    x = ames2 %>% dplyr::select(-Sale_Price),
    y = log10(ames2$Sale_Price),
    committees = 3
    ) 

tr <- tidy_rules(cb_mod)
talegari commented 6 years ago

Thanks Max,

I will submit a new PR shortly. tidy_rules_spaces_handled.pdf

edit: PR is here

topepo commented 3 years ago

I think that you solved with with tidyrules.