Closed talegari closed 3 years ago
This looks great! I PR would be very welcome.
One thing: cubist
has the same rule format (but a different model structure in the terminal nodes/rules). It would make sense to have this work for both. Can you adapt it to work with that package? I've been putting any joint infrastructure functions in Cubist
(e.g. makeDataFile
etc). If this were put in C50
, it would make circular dependencies.
Thanks Max.
Cubist
output can be processed similarly. I will write a function to create a tidy dataframe for that and submit a PR there.
About the design:
tidy_rules
with tidy_rules.C50
and tidy_rules.Cubist
methods?tidy_rules
going into both packages separately?Please suggest.
With other functions that they share, I add them to Cubist
then import from from there into C50
.
So try to write the function so that they share as much common code as possible then add those common functions and the tidy_rules
generic to Cubist
. Then C50
can import the class and have its own tidy_rules
method.
@topepo Please review this draft before PR submission.
It looks really good.
Some recommendations/comments:
committee
(singular) and rule
to better match the output. Gr Liv Area
gets translated to Gr * Liv * Area
(code below). Odd factor levels seem fine though. model
object? That's built to be parsed. Here is an example using the vignette data set:> cubist(x = train_pred[, 1:2], y = train_resp)$model %>% cat()
id="Cubist 2.07 GPL Edition 2018-09-01"
prec="1" globalmean="22.41485" extrap="1" insts="0" ceiling="95" floor="0"
att="outcome" mean="22.41" sd="9.284727" min="5" max="50"
att="crim" mean="3.789463" sd="8.553482" min="0.00906" max="88.9762"
att="zn" mean="11.38" sd="23.47519" min="0" max="100"
entries="1"
rules="2"
conds="1" cover="54" mean="12.27" loval="5" hival="27.9" esterr="3.96"
type="2" att="crim" cut="9.2322998" result=">"
coeff="13.25" att="crim" coeff="-0.11" att="zn" coeff="0.009"
conds="1" cover="350" mean="23.98" loval="8.1" hival="50" esterr="5.59"
type="2" att="crim" cut="9.2322998" result="<="
coeff="21.79" att="crim" coeff="-0.62" att="zn" coeff="0.105"
> cubist(x = train_pred[, 1:2], y = train_resp) %>% summary()
Call:
cubist.default(x = train_pred[, 1:2], y = train_resp)
Cubist [Release 2.07 GPL Edition] Sat Sep 1 17:43:44 2018
---------------------------------
Target attribute `outcome'
Read 404 cases (3 attributes) from undefined.data
Model:
Rule 1: [54 cases, mean 12.27, range 5 to 27.9, est err 3.96]
if
crim > 9.2323
then
outcome = 13.25 - 0.11 crim + 0.009 zn
Rule 2: [350 cases, mean 23.98, range 8.1 to 50, est err 5.59]
if
crim <= 9.2323
then
outcome = 21.79 - 0.62 crim + 0.105 zn
Evaluation on training data (404 cases):
Average |error| 5.65
Relative |error| 0.85
Correlation coefficient 0.50
Attribute usage:
Conds Model
100% 100% crim
100% zn
Time: 0.0 secs
Some testing code:
library(Cubist)
library(AmesHousing)
library(tidymodels)
ames <- make_ames()
ames2 <-
ames %>%
dplyr::rename(`Gr Liv Area` = Gr_Liv_Area) %>%
mutate(
Overall_Qual = gsub("_", " ", as.character(Overall_Qual)),
MS_SubClass = gsub("_", " ", as.character(MS_SubClass))
)
cb_mod <-
cubist(
x = ames2 %>% dplyr::select(-Sale_Price),
y = log10(ames2$Sale_Price),
committees = 3
)
tr <- tidy_rules(cb_mod)
Thanks Max,
I will submit a new PR shortly. tidy_rules_spaces_handled.pdf
edit: PR is here
I think that you solved with with tidyrules
.
Hi Max,
I end up using C5 often for its speed and rules. Thanks for the package!
Although the
summary
function prints the rules in a handy way, it might be sometimes preferable to have them in a tidy way.Rules displayed on calling
summary
function:Output of the tidying function:
Note that the LHS is string parseable as a R expression. Hence, it can be simply pasted into
dplyr::filter
.Here is the code to tidy the rules and the code snippet to run an example.
Please suggest if it might be a good idea to include this in
broom
package instead of here. Else, let me know if you are open for a PR.Suggestions are welcome!
Regards, Srikanth KS