Tidy rules - Githubissues

talegari commented 6 years ago

Hi Max,

I end up using C5 often for its speed and rules. Thanks for the package!

Although the summary function prints the rules in a handy way, it might be sometimes preferable to have them in a tidy way.

Rules displayed on calling summary function:

Rules:

Rule 1: (50, lift 2.9)
    Petal.Length <= 1.9
    ->  class setosa  [0.981]

Rule 2: (48/1, lift 2.9)
    Petal.Length > 1.9
    Petal.Length <= 4.9
    Petal.Width <= 1.7
    ->  class versicolor  [0.960]

Rule 3: (46/1, lift 2.9)
    Petal.Width > 1.7
    ->  class virginica  [0.958]

Rule 4: (46/2, lift 2.8)
    Petal.Length > 4.9
    ->  class virginica  [0.938]

Default class: setosa

Output of the tidying function:

support	confidence	lift	LHS	RHS	n_conditions
50	1.0000000	2.94231	Petal.Length < 1.9	setosa	1
48	0.9791667	2.88000	Petal.Length > 1.9 & Petal.Length < 4.9000001 & Petal.Width < 1.7	versicolor	3
46	0.9782609	2.87500	Petal.Width > 1.7	virginica	1
46	0.9565217	2.81250	Petal.Length > 4.9000001	virginica	1

Note that the LHS is string parseable as a R expression. Hence, it can be simply pasted into dplyr::filter.

Here is the code to tidy the rules and the code snippet to run an example.

source("https://gist.github.com/talegari/dde1bc3aaed88533bcf7ee137296830a/raw/9bfc1fe894428b21ad2c94dd2d83b6277470dd60/tidy_rules_C5")

dplyr::glimpse(iris)
model <- C50::C5.0(Species ~ ., data = iris, rules = TRUE) # build a C5 model
summary(model)                                             # print rules

tidy_rules(model) %>% knitr::kable()

Please suggest if it might be a good idea to include this in broom package instead of here. Else, let me know if you are open for a PR.

Suggestions are welcome!

Regards, Srikanth KS

topepo commented 6 years ago

This looks great! I PR would be very welcome.

One thing: cubist has the same rule format (but a different model structure in the terminal nodes/rules). It would make sense to have this work for both. Can you adapt it to work with that package? I've been putting any joint infrastructure functions in Cubist (e.g. makeDataFile etc). If this were put in C50, it would make circular dependencies.

talegari commented 6 years ago

Thanks Max.

Cubist output can be processed similarly. I will write a function to create a tidy dataframe for that and submit a PR there.

About the design:

Should we have a S3 generic tidy_rules with tidy_rules.C50 and tidy_rules.Cubist methods?
Or two different functions named tidy_rules going into both packages separately?

Please suggest.

topepo commented 6 years ago

With other functions that they share, I add them to Cubist then import from from there into C50.

So try to write the function so that they share as much common code as possible then add those common functions and the tidy_rules generic to Cubist. Then C50 can import the class and have its own tidy_rules method.

talegari commented 6 years ago

@topepo Please review this draft before PR submission.

Please suggest changes in column names and order if necessary
I am not able to get cases where no rules are generated. If you have an example for that, I will cover that edge case.

topepo commented 6 years ago

It looks really good.

Some recommendations/comments:

Have the output contain a column for committee (singular) and rule to better match the output.
It looks like non-standard names need to be escaped somehow. A variable named Gr Liv Area gets translated to Gr * Liv * Area (code below). Odd factor levels seem fine though.
Don't forget to update the NEWS file with the change.
Did you consider using the model object? That's built to be parsed. Here is an example using the vignette data set:

> cubist(x = train_pred[, 1:2], y = train_resp)$model %>% cat()
id="Cubist 2.07 GPL Edition 2018-09-01"
prec="1" globalmean="22.41485" extrap="1" insts="0" ceiling="95" floor="0"
att="outcome" mean="22.41" sd="9.284727" min="5" max="50"
att="crim" mean="3.789463" sd="8.553482" min="0.00906" max="88.9762"
att="zn" mean="11.38" sd="23.47519" min="0" max="100"
entries="1"
rules="2"
conds="1" cover="54" mean="12.27" loval="5" hival="27.9" esterr="3.96"
type="2" att="crim" cut="9.2322998" result=">"
coeff="13.25" att="crim" coeff="-0.11" att="zn" coeff="0.009"
conds="1" cover="350" mean="23.98" loval="8.1" hival="50" esterr="5.59"
type="2" att="crim" cut="9.2322998" result="<="
coeff="21.79" att="crim" coeff="-0.62" att="zn" coeff="0.105"
> cubist(x = train_pred[, 1:2], y = train_resp) %>% summary()

Call:
cubist.default(x = train_pred[, 1:2], y = train_resp)

Cubist [Release 2.07 GPL Edition]  Sat Sep  1 17:43:44 2018
---------------------------------

    Target attribute `outcome'

Read 404 cases (3 attributes) from undefined.data

Model:

  Rule 1: [54 cases, mean 12.27, range 5 to 27.9, est err 3.96]

    if
    crim > 9.2323
    then
    outcome = 13.25 - 0.11 crim + 0.009 zn

  Rule 2: [350 cases, mean 23.98, range 8.1 to 50, est err 5.59]

    if
    crim <= 9.2323
    then
    outcome = 21.79 - 0.62 crim + 0.105 zn

Evaluation on training data (404 cases):

    Average  |error|               5.65
    Relative |error|               0.85
    Correlation coefficient        0.50

    Attribute usage:
      Conds  Model

      100%   100%    crim
             100%    zn

Time: 0.0 secs

Some testing code:

library(Cubist)
library(AmesHousing)
library(tidymodels)

ames <- make_ames()

ames2 <- 
  ames %>%
  dplyr::rename(`Gr Liv Area` = Gr_Liv_Area) %>%
  mutate(
    Overall_Qual = gsub("_", " ", as.character(Overall_Qual)),
    MS_SubClass = gsub("_", " ", as.character(MS_SubClass))
    )

cb_mod <- 
  cubist(
    x = ames2 %>% dplyr::select(-Sale_Price),
    y = log10(ames2$Sale_Price),
    committees = 3
    ) 

tr <- tidy_rules(cb_mod)

talegari commented 6 years ago

Thanks Max,

Changed to 'committee'.
Names with spaces have been handled. See the attached doc.
Updated the news file.
I had not noticed the model object. Parsing the model object or the summary output seem almost the equivalent. I would stick with the parsing summary output, unless you see a compelling reason to change.

I will submit a new PR shortly. tidy_rules_spaces_handled.pdf

edit: PR is here

topepo commented 3 years ago

I think that you solved with with tidyrules.

topepo / C5.0

Tidy rules #16