Makes handling output from decision trees easy. Treezy.
Decision trees are a commonly used tool in statistics and data science, but sometimes getting the information out of them can be a bit tricky, and can make other operations in a pipeline difficult.
treezy
makes it easy to:
The data structures created in treezy
- importance_table
are making
their way over to the
broomstick
package - a member
of the broom family specifically focussing on decision trees, which
gives different output to many of the (many!) packages/analyses that
broom deals with.
I am interested in feedback, so please feel free to file an
issue if you have any
problems!
# install.packages("remotes")
remotes::install_github("njtierney/treezy")
importance_table
and importance_plot
library(treezy)
library(rpart)
fit_rpart_kyp <- rpart(Kyphosis ~ ., data = kyphosis)
# default method for looking at importance
# variable importance
fit_rpart_kyp$variable.importance
#> Start Age Number
#> 8.198442 3.101801 1.521863
# with treezy
importance_table(fit_rpart_kyp)
#> # A tibble: 3 x 2
#> variable importance
#> <fct> <dbl>
#> 1 Start 8.20
#> 2 Age 3.10
#> 3 Number 1.52
importance_plot(fit_rpart_kyp)
# extend and modify
library(ggplot2)
importance_plot(fit_rpart_kyp) +
theme_bw() +
labs(title = "My Importance Scores",
subtitle = "For a CART Model")
library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
#>
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:ggplot2':
#>
#> margin
set.seed(131)
fit_rf_ozone <- randomForest(Ozone ~ .,
data = airquality,
mtry=3,
importance=TRUE,
na.action=na.omit)
fit_rf_ozone
#>
#> Call:
#> randomForest(formula = Ozone ~ ., data = airquality, mtry = 3, importance = TRUE, na.action = na.omit)
#> Type of random forest: regression
#> Number of trees: 500
#> No. of variables tried at each split: 3
#>
#> Mean of squared residuals: 302.2117
#> % Var explained: 72.46
## Show "importance" of variables: higher value mean more important:
# randomForest has a better importance method than rpart
importance(fit_rf_ozone)
#> %IncMSE IncNodePurity
#> Solar.R 9.7565080 10741.332
#> Wind 22.1546372 44234.960
#> Temp 43.8519213 53787.557
#> Month 2.5862801 1692.479
#> Day 0.9264646 6606.387
## use importance_table
importance_table(fit_rf_ozone)
#> # A tibble: 5 x 3
#> variable `%IncMSE` IncNodePurity
#> <fct> <dbl> <dbl>
#> 1 Solar.R 9.76 10741.
#> 2 Wind 22.2 44235.
#> 3 Temp 43.9 53788.
#> 4 Month 2.59 1692.
#> 5 Day 0.926 6606.
# now plot it
importance_plot(fit_rf_ozone)
# CART
rss(fit_rpart_kyp)
#> [1] 13
# randomForest
rss(fit_rf_ozone)
#> [1] 33545.5
# using gbm.step from the dismo package
library(gbm)
library(dismo)
# load data
data(Anguilla_train)
anguilla_train <- Anguilla_train[1:200,]
# fit model
angaus_tc_5_lr_01 <- gbm.step(data = anguilla_train,
gbm.x = 3:14,
gbm.y = 2,
family = "bernoulli",
tree.complexity = 5,
learning.rate = 0.01,
bag.fraction = 0.5)
gg_partial_plot(angaus_tc_5_lr_01,
var = c("SegSumT",
"SegTSeas"))
gbm
, tree
, ranger
,
xgboost
, and more)broom
’s augment
, tidy
, and glance
functions. For example, rpart_fit$splits
Credit for the name, “treezy”, goes to @MilesMcBain, thanks Miles!