Adding Oblique Trees (Forest-RC) to the Cythonized Tree Module

adam2392 commented 3 years ago

Describe the workflow you want to enable

I am part of the @neurodata team. We are proposing to add a cythonized module that allows for building oblique trees.

Oblique trees, otherwise known as Forest-RC, originally proposed by, Breiman 2001 [1], would enable a more flexible class of decision trees. Breiman called a version of these Forest-RC, as compared to what most people call random forest, which he called Forest-RI. He noted:

"Forest-RC does exceptionally well on the synthetic data sets. Overall, it compares more favorably to Adaboost than Forest-RI."

In other words, in Breiman's paper introduction random forest over 20 years ago, with over 75,000 citations, he presented 2 algorithms, and argued that the one which does sparse linear combinations of features worked better than the naive one which most people have used since. More recently, several authors have corroborated these original results [2-6]. Oblique trees in general open up a wide class of flexibility for trees to take advantage of structure in data.

[1]: Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324 [2]: T. M. Tomita, J. Browne, C. Shen, J. Chung, J. L. Patsolic, B. Falk, J. Yim, C. E. Priebe, R. Burns, M. Maggioni, and J. T. Vogelstein. Sparse Projection Oblique Randomer Forests. Journal of Machine Learning Research, 2020. [3]: Tyler M Tomita, Mauro Maggioni, and Joshua T Vogelstein. Roflmao: Robust oblique forests with linear matrix operations. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 498–506. Society for Industrial and Applied Mathematics, 2017. [4]: Rainforth, Tom & Wood, Frank. (2015). Canonical Correlation Forests. [5]: Perry, R., Li, A., Huynh, C., Tomita, T.M., Mehta, R., Arroyo, J., Patsolic, J., Falk, B., & Vogelstein, J. (2019). Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks. [6]: Menze, Bjoern & Kelm, Bernd & Splitthoff, Daniel & Köthe, Ullrich & Hamprecht, Fred. (2011). On Oblique Random Forests. 453-469. 10.1007/978-3-642-23783-6_29.

Describe your proposed solution

The implementation of Forest-RC (oblique forests) was ported into a scikit-learn fork (https://github.com/neurodata/scikit-learn/pull/11/). The Forest-RC Cython implementation currently builds cleanly on top of the existing cython code for the regular Decision Trees and thus does not affect the runtime, or performance on Random Forests at all. Moreover, it improves upon RF in certain cc_18 benchmarks (see figure below).

The .pxd files can also be included in each scikit-learn release, so that other sklearn-compatible forest modules can build upon our implementation of the Oblique tree.

Relevant Benchmarks

To ease scikit-learn developer concerns on incorporating such a large piece of code change, we can provide the following benchmarks and tests on Cythonized Forest-RC vs Cythonized Forest-RI. These benchmarks were ran on the OpenML CC-18 dataset. Max features were set to n_features, so runtime was very high for both RF and oblique forests in e.g. Fashion-MNIST.

benchmark tests on cc_18 vs Random Forests (see two figures below) Forest-RC is significantly better then RF on this benchmark suite. A paired one-sided Wilcoxon signed test was ran. Forest-RC was significantly better then RF with a PValue ~ 1e-20.

overall runtime (training times on cc_18) vs Random Forests

Runtime benchmarking dependency on number of features and number of samples: In addition, we performed a benchmarking the runtime performance of oblique trees/forests using the code provided in benchmarks/bench_tree.py. The notebook is shown in this gist: https://gist.github.com/adam2392/a652aa1c88ba94f0d6aab96ccd3ade24
visualizing the intuitive differences between Oblique and Axis-aligned trees: Per a discussion with @thomasjpfan , he wanted to see how we can educate an incoming user when/how oblique trees are different. We can extend the existing example in sklearn. The following notebook shows the decision surface of an oblique tree vs axis-aligned tree: https://gist.github.com/adam2392/78519091104cecfeb8ff796eac7e8115

Additional context

cc: @jovo

Some action items are:

Unit testing (I will probably need some direction as to where tests can be added as say parametrization, and what sort of tests the maintainers would like to see)
Make the official PR so developers/maintainers can track progress

adam2392 commented 3 years ago

To clarify, we have a cythonized codebase that generated the said results. It is roughly factored into the scikit-learn fork right now, but once the devs discuss and give me the green light, I can open a PR, where the diff will be more obvious.

that doesn't affect non-oblique RF in any way
it builds cleanly upon the existing tree functions when possible

glemaitre commented 3 years ago

I think that SPORF does not fulfil the inclusion criterion: https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

As stated in the documentation, you could create a scikit-learn-contrib package that would depend on scikit-learn or if the maintenance of the full package is too much, check for inclusion in scikit-learn-extra. This would give visibility to the project.

that doesn't affect non-oblique RF in any way

When I looked at the PR, it looked to me like if it was a lot of duplicated code. I thought that it could be possible to inherate more functionality from the scikit-learn code base (but I did not look enough to have useful feedback thought).

adam2392 commented 3 years ago

Hi @glemaitre I am Adam, part of the @neurodata group.

I think that SPORF does not fulfil the inclusion criterion: https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

Thank you for your comment. We apologize for the confusion here. This issue and PR is primarily a Forest-RC cython implementation, with some additional functionality. I went ahead and edited the proposal to be a bit more clear.

The proposal here is for specifically Forest-RC, which was introduced by Breiman in 2001. Forest-RC complements and builds on top of the Forest-RI algorithm (called RandomForestClassifier/Regressor in scikit-learn). Since that paper has over 75,000 citations and has been monumental in the field, many papers and experiments have demonstrated the empirical performance of variants of this Forest-RC, which further validate the strength of the Forest-RC approach.

When I looked at the PR, it looked to me like if it was a lot of duplicated code. I thought that it could be possible to inherate more functionality from the scikit-learn code base (but I did not look enough to have useful feedback thought).

I have tidied up a test PR a bit to show the diff a bit easier. https://github.com/neurodata/scikit-learn/pull/11/files.

In the PR, the main actual code/logic that’s added is in the files that handle “oblique splitting”. This enables one to sample random features and then combine them before performing the Random Forest procedure. Right now it duplicates some portions of the code because I presume it's better to not touch the highly optimized Random Forest code.

However, I can imagine with slight refactoring of the underlying Random Forest tree code to allow for weights applied to each split (currently they are just unit vectors), we can allow Oblique trees to take up less LOC.

WDYT?

adam2392 commented 3 years ago

Hi @glemaitre just wanted to follow up on this thread now that v1.0 has been released (congrats and awesome milestone for the package). Happy to jump on a zoom call if that's easier too!

adam2392 commented 3 years ago

Hi @glemaitre just wanted to follow up again on this. Just wanted to reiterate my above points where this algorithm ('Forest-RC') is the one proposed in Breiman's 2001 paper and also has a good empirical support from the cc18 benchmark. I think there's some issues to sort out regarding how to best factor in this code.

I'm happy to submit a PR for initial skim review and then possibly break up the cython portions into smaller PRs for easier review and merges.

adam2392 commented 3 years ago

Comments from call w/ discord sklearn OH (general) @glemaitre @adrinjalali @thomasjpfan:

Refactoring existing tree code
Adding feature

The goal here is to refactor the complicated tree code currently to support easier maintenance and then allow inclusion of more exotic features at the tree level.

Possible order of PRs to first refactor:

TBD until design doc of @thomasjpfan

^ @adam2392 to "try" at refactoring easier to harder

adam2392 commented 3 years ago

Hi @thomasjpfan just wanted to see if there was any initial draft that you had made that wasn't finalized, so I can kinda see where this might go. Happy to contribute thoughts too!

Also, FYI: https://github.com/neurodata/scikit-learn/pull/11 has a test draft PR demonstrating how I "naively" implemented oblique trees on top of the existing axis-aligned trees. In case there's some inspiration for cases the design doc might need to account for?

thomasjpfan commented 3 years ago

There are several other items on my TODO list before I get to the tree design. I should have a design ready to show in 2-3 weeks.

In case there's some inspiration for cases the design doc might need to account for?

Your draft PR will be something I will keep in mind when designing the trees. I want to make sure the design will be extendable for your use case.

adam2392 commented 3 years ago

Hi @thomasjpfan just wanted to check in. Or would OH be a better time to discuss?

adam2392 commented 3 years ago

Hope you had a good thanksgiving @thomasjpfan. Just wanted to check in on the status of this?

thomasjpfan commented 3 years ago

There are a few more constraints I need to work out:

Make sure that the use case in NOCATs (categorical features) is covered: https://github.com/scikit-learn/scikit-learn/pull/12866
New loss functions with openmp: https://github.com/scikit-learn/scikit-learn/pull/20567 (Most likely do not need to consider it, since criterion's requirements are a bit different from the losses)

I have finally gotten through most of my backlog, so the tree design + prototype is on my agenda this week.

adam2392 commented 3 years ago

Sounds good, happy to discuss and help implement when this is finalized! Exciting.

thomasjpfan commented 3 years ago

To keep a written record, I also need to look at:

Monotonic trees: https://github.com/scikit-learn/scikit-learn/pull/13649

thomasjpfan commented 2 years ago

I am trying to think of a good roadmap to get a tree refactor going since the tree code is very coupled together. There is a lot of code to refactor to satisfy all the use cases. The refactor would make new features easier but the first PR will take a long time to review.

SplitRecord being a struct is useful because it does not require the gil to create. I think are ways to work around this and make SplitRecord into a class. Even with a workaround, I think one still needs custom logic for the a custom split record, which needs a custom Splitter and a custom Tree object to understand the custom record type. In the end I see it going back to what was already done in https://github.com/neurodata/scikit-learn/pull/11 There is a lot of flexible you get from copying the existing code and making the modifications where you see fit.

adam2392 commented 2 years ago

I am trying to think of a good roadmap to get a tree refactor going since the tree code is very coupled together. There is a lot of code to refactor to satisfy all the use cases. The refactor would make new features easier but the first PR will take a long time to review.

Sounds good. Happy to help implement this!

In the end I see it going back to what was already done in neurodata#11

Is this because cython doesn't support enough c++ features yet? I was thinking that if you can sub-class the SplitRecord, and Splitter, then the Tree object should work out of the box? In https://github.com/neurodata/scikit-learn/pull/11, the Tree code is actually all the same except for the type of SplitRecord and Splitter.

I can come to OH to discuss more if you'll be there?

thomasjpfan commented 2 years ago

I can come to OH to discuss more if you'll be there?

I'll be attending so we can chat there. In the ObliqueTree case, it needs to know how to add a node based on the custom SplitRecord, which is what you did with _add_oblique_node. If we want to be fully generic, then the SplitRecord should be passed into _add_node and we let the tree decide how it wants to add the record. ObliqueTree also needs to have its own apply functions because it has additional weights to take into account. For the ObliqueTree, I think only the builder is generic enough to not need a rewrite as long as theSplitRecord` is passed around.

Stepping back a little, I see ObliqueTree meets inclusion through citations of the original Random Forest paper, but including it will increase the maintenance cost of the library. I think Oblique trees needs a story around "How should a user decide between the current trees and the oblique ones" to be considered for inclusion. From the benchmarks you shown and from the "Sparse Projection Oblique Randomer Forests" paper, it look like it is another hyper-parameter to tune.

Since you have a working implementation for oblique trees, I think you can create a python package and release it. This would be the fastest way to get something released and get feedback from users.

adam2392 commented 2 years ago

I'll be attending so we can chat there.

Sounds good. I will be there.

In the ObliqueTree case, it needs to know how to add a node based on the custom SplitRecord, which is what you did with _add_oblique_node. If we want to be fully generic, then the SplitRecord should be passed into _add_node and we let the tree decide how it wants to add the record. ObliqueTree also needs to have its own apply functions because it has additional weights to take into account. For the ObliqueTree, I think only the builder is generic enough to not need a rewrite as long as theSplitRecord` is passed around.

Sounds like a good plan and idea to me. We are happy to work with you extensively on implementations and refactoring.

This generalization could then very easily allow researchers to generalize the sklearn implementation and build their own Splitters that can then be housed in another repository. For example, one can imagine a number of "new research" Splitters that sample the feature indices, and weights in different ways. The base of that would be the Oblique trees that Breiman proposed as Forest-RC.

Stepping back a little, I see ObliqueTree meets inclusion through citations of the original Random Forest paper, but including it will increase the maintenance cost of the library. I think Oblique trees needs a story around "How should a user decide between the current trees and the oblique ones" to be considered for inclusion. From the benchmarks you shown and from the "Sparse Projection Oblique Randomer Forests" paper, it look like it is another hyper-parameter to tune.

Due to papers attached we believe in general the recommendation should be to use oblique trees rather than axis aligned. These are all papers that explore different types of "oblique" splits and compare them with traditional axis-aligned trees. In general, oblique forests works better, specifically when most of the features are continuous. If there is a separating hyperplane that is not axis-aligned, then axis-aligned trees require many splits to approach this hyperplane vs hypothetically 1 oblique split. Empirically, oblique trees also tend to be smaller. The caveat is the extra computation time which we can run more experiments to explore. However, even on the CC18 dataset, we demonstrate that the extra computation time is not "that" much more. One could drastically speed up computation time by applying binning on the feature space.

We can also fix the sparsity of the obliqueness to some specific number to remove the necessity of tuning this hyperparameter. Note I didn't tune this hyperparameter at all in the CC18 ML benchmark shown where oblique trees outperform RF.

adam2392 commented 2 years ago

Keeping notes from Discord discussion: The general issue is regarding inclusion of OF and how to tell users about RF vs OF. How can we convince core dev team that adding this complexity is worth it.

empirical simulation of scaling of time to show users of what RF vs OF.
do make regression and make classification or scikit-learn datasets modules to make datasets that linearly connected and also scale higher, so we can demonstrate how OF scales vs RF for training and prediction for more data
getting a champion other then you regarding the general inclusion of OF and how to tell users about OF vs RF. talk to @jovo
See if anyone in industry using OF in production? E.g. the R package?
A Jupiter notebook demonstrating when OF clearly wins against RF written in the style of how scikit-learn tutorial introduces RF

Lmk if I missed anything.

jovo commented 2 years ago

@fwood I suspect will have deeply insightful things to say about Oblique Trees/Forests

adam2392 commented 2 years ago

Update 01/11/2022

I've updated the description of the GH issue with two gist notebooks: one benchmarking the expected runtime performance as a function of n_features and n_samples, and one comparing the decision surface of an axis-aligned and oblique tree on the iris dataset.

These two taken together motivate that oblique trees are not that much more computationally expensive in general. Moreover, the visualization example with the iris dataset would serve as a motivating example to users for how to think about oblique trees/splits. The goal is to demonstrate to the core-devs that inclusion of oblique trees is well-motivated.

Runtime benchmarking dependency on number of features and number of samples: The notebook is shown in this gist: https://gist.github.com/adam2392/a652aa1c88ba94f0d6aab96ccd3ade24
visualizing the intuitive differences between Oblique and Axis-aligned trees: The following notebook shows the decision surface of an oblique tree vs axis-aligned tree: https://gist.github.com/adam2392/78519091104cecfeb8ff796eac7e8115. It would educate an incoming user when/how oblique trees are different. We can extend the existing example in sklearn.

I'm pinging the people that was suggested as core-devs involved in the tree code to get a firm answer on inclusion: @NicolasHug , @amueller , @ogrisel , @glemaitre , @adrinjalali and @thomasjpfan. Lmk if I missed anything, or anyone.

Misc.

As discussed in OH on 12/6/2021, we realize that the modification of the tree code will involve i) refactoring the tree code, ii) adding a PR for oblique trees functionality along with tests and iii) updating examples to discuss axis-aligned vs oblique trees. We are happy to help in refactoring and improving the tree code and believe this series of PRs will be a great improvement to scikit-learn paving the way for downstream scikit-tree/forest based packages.

Moreover, I realize that there are some big projects going on for scikit-learn, so the goal of this comment is to get on the same page first regarding inclusion given our evidence provided from i) publications, ii) cc_18 benchmark, iii) runtime benchmarks and iv) intuitive visualization.

adam2392 commented 2 years ago

Wanted to ping the above, since it's been awhile. In summary:

inclusion criterion met (original RF paper > 300 citations and many empirical papers demonstrate superiority of oblique splits)
uncertainty around maintenance if a lot of copying code is required (see below tho)
runtime benchmarking done wrt size of the dataset
interpretation of oblique vs axis-aligned trees are shown using iris dataset to educate users

Since the main issue is the maintenance, I've done some work in making the upgrade simpler and modular. Are there any issues I haven't covered? I would like to get some developer opinions before ppl forget :p.

Improving Maintainability

The core issue is that fusedtype are not allowed in extension types (i.e. Cython classes). However, this issue has been circumvented with Tempita (see: https://github.com/scikit-learn/scikit-learn/pull/20481). Moreover, with some minor refactoring of the tree codebase, I can demonstrate that the oblique upgrade is a lot easier then previously shown.

amueller commented 2 years ago

I think the citation criterion is a bit tricky because the original paper isn't cited for this algorithm, but I think it this meets criteria. If the maintenance burden isn't too crazy, I think we should try to include it.

adam2392 commented 2 years ago

For those interested, here is the built docs describing the oblique trees: https://output.circle-artifacts.com/output/job/78b93aa0-e14c-4b76-a3ba-86b87f501d88/artifacts/0/doc/modules/tree.html#oblique-trees

scikit-learn / scikit-learn