ubc-systopia / treeFarms

Trees FAst RashoMon Sets
Other
37 stars 5 forks source link

Exploring the Whole Rashomon Set of Sparse Decision Trees (TreeFARMS)

This code creates Rashomon sets of decision trees. The Rashomon set is the set of all almost-optimal models. This code is able to enumerate the Rashomon set for sparse decision trees. In other words, instead of returning a single optimal decision tree, it returns a set of decision trees with an objective (misclassification loss with a penalty on the number of leaves) below a pre-defined threshold. To learn more about TreeFARMS, please read our research paper (published at NeurIPS'22).

Given the Rashomon set enumerated by this code, an interactive tool TimberTrek can be used to visualize and explore this set of trees.

To make TreeFARMS run faster, please use the options to limit the depth of the tree, and increase the regularization parameter above 0.02. If you run the algorithm without a depth constraint or set the regularization too small, it will run more slowly.

TreeFARMS builds on a number of innovations for scalable construction of optimal tree-based classifiers: Scalable Bayesian Rule Lists[8], CORELS[2], OSDT[4], and, most closely, GOSDT[5].

Note: this repository is built based on Fast Sparse Decision Tree Optimization via Reference Ensembles. Package is renamed to allow the installation of both packages. In this repository, all references to GOSDT refers to TreeFARMS algorithm.

Table of Content


Installation

You may use the following commands to install TreeFARMS along with its dependencies on macOS, Ubuntu and Windows.
You need Python 3.7 or later to use the module treefarms in your project. If you encountered an error ERROR: Could not find a version that satisfies the requirement treefarms, try updating your pip to the latest version.

pip3 install attrs packaging editables pandas scikit-learn sortedcontainers gmpy2 matplotlib
pip3 install treefarms

You can find a list of available wheels on PyPI.
Please feel free to open an issue if you do not see your distribution offered.


Compilation

Please refer to the manual to build the C++ command line interface and the Python extension module and run the experiment with example datasets on your machine.


Configuration

The configuration is a JSON object and has the following structure and default values:

{

    "regularization": 0.05,
    "rashomon": true, 
    "rashomon_bound_multiplier": 0.05,

    "rashomon_bound": 0,
    "rashomon_bound_adder": 0,
    "output_accuracy_model_set": false, 
    "output_covered_sets": [],
    "covered_sets_thresholds": [],
    "rashomon_model_set_suffix": "",
    "rashomon_ignore_trivial_extensions": true,
    "rashomon_trie": "",

    "depth_budget": 0,
    "uncertainty_tolerance": 0.0,
    "upperbound": 0.0,
    "precision_limit": 0,
    "stack_limit": 0,
    "tile_limit": 0,
    "time_limit": 0,
    "worker_limit": 1,
    "minimum_captured_points": 0,
    "cancellation": true,
    "look_ahead": true,
    "diagnostics": false,
    "verbose": true,
    "costs": "",
    "model": "",
    "profile": "",
    "timing": "",
    "trace": "",
    "tree": "",
    "datatset_encoding": "",
    "memory_checkpoints": [],
  }

Key parameters

regularization

rashomon

rashomon_bound_multiplier

More parameters

Rashomon-specific configs

rashomon_bound

rashomon_bound_adder

output_accuracy_model_set

output_covered_sets

covered_sets_thresholds

rashomon_model_set_suffix

rashomon_ignore_trivial_extensions

rashomon_trie

Flag

balance

cancellation

look_ahead

similar_support

feature_exchange

continuous_feature_exchange

feature_transform

rule_list

non_binary

diagnostics

verbose

Tuners

uncertainty_tolerance

upperbound

Limits

depth_budget

time_limit

precision_limit

stack_limit

worker_limit

Files

costs

model

profile

timing

trace

tree


Example

Example code to run TreeFARMS with threshold guessing, lower bound guessing, and depth limit. The example python file is available in treefarms/example.py. A tutorial ipython notebook is available in treefarms/tutorial.ipynb.


Structure

This repository contains the following directories and files:


FAQs

If you run into any issues when running TreeFARMS, consult the FAQs first.


License

This software is licensed under a 3-clause BSD license (see the LICENSE file for details).


Related Work

[1] Aglin, G.; Nijssen, S.; and Schaus, P. 2020. Learning optimal decision trees using caching branch-and-bound search. In AAAI Conference on Artificial Intelligence, volume 34, 3146–3153.

[2] Angelino, E.; Larus-Stone, N.; Alabi, D.; Seltzer, M.; and Rudin, C. 2018. Learning Certifiably Optimal Rule Lists for Categorical Data. Journal of Machine Learning Research, 18(234): 1–78.

[3] Breiman, L.; Friedman, J.; Stone, C. J.; and Olshen, R. A. 1984. Classification and Regression Trees. CRC press.

[4] Hu, X.; Rudin, C.; and Seltzer, M. 2019. Optimal sparse decision trees. In Advances in Neural Information Processing Systems, 7267–7275.

[5] Lin, J.; Zhong, C.; Hu, D.; Rudin, C.; and Seltzer, M. 2020. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning (ICML), 6150–6160.

[6] Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann

[7] Verwer, S.; and Zhang, Y. 2019. Learning optimal classification trees using a binary linear program formulation. In AAAI Conference on Artificial Intelligence, volume 33, 1625–1632.

[8] Yang, H., Rudin, C., & Seltzer, M. (2017, July). Scalable Bayesian rule lists. In International Conference on Machine Learning (ICML) (pp. 3921-3930). PMLR.