h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Add a RuleFit page to the User Guide #7878

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

We should add a page on RuleFit to the Supervised algorithms section in the User Guide: [http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html#supervised|http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html#supervised]

The new content:

h2. Introduction

Rulefit algorithm combines tree ensembles and linear models to take advantage of both methods: a tree ensemble accuracy and a linear model interpretability.

The general algorithm fits a tree ensebmle to the data, builds a rule ensemble by traversing each tree, evaluates the rules on the data to build a rule feature set and fits a sparse linear model (LASSO) to the rule feature set joined with the original feature set.

h2. Defining a RuleFit Model (beta API)

h2. Interpreting a RuleFit Model

The output for the RuleFit model includes:

h2. Examples

in R:

{noformat}library(h2o) h2o.init()

f <- "https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv" titanic <- h2o.importFile(f)

response = "survived" predictors <- c("age", "sibsp", "parch", "fare", "sex", "pclass")

titanic[,response] <- as.factor(titanic[,response]) titanic[,"pclass"] <- as.factor(titanic[,"pclass"])

rf_h2o = h2o.rulefit(y=response, x=predictors, training_frame = titanic, max_rule_length=10, max_num_rules=100, seed=1234)

print(rf_h2o@model$rule_importance){noformat}

in Py:

{noformat}import h2o h2o.init() from h2o.estimators.rulefit import H2ORuleFitEstimator

df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv", col_types={'pclass': "enum", 'survived': "enum"})

x = ["age", "sibsp", "parch", "fare", "sex", "pclass"]

rf_h2o = H2ORuleFitEstimator(max_rule_length=10, max_num_rules=100, seed=1234, model_type="rules_and_linear") rf_h2o.train(training_frame=df, x=x, y="survived")

print(rf_h2o._model_json['output']['rule_importance']){noformat}

h2. References

FRIEDMAN, J. H., & POPESCU, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), 916-954.

exalate-issue-sync[bot] commented 1 year ago

Zuzana Olajcová commented: Hi [~accountid:5d1185d4f46aa30c271c7cc6] , I’ve prepared the content to update docs. Can you please review it from your POV and add it to the User Guide? Thanks!

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7763 Assignee: hannah.tillman Reporter: Erin LeDell State: Resolved Fix Version: 3.32.0.1 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/4928 https://github.com/h2oai/h2o-3/pull/4943