Open adam2392 opened 3 years ago
To clarify, we have a cythonized codebase that generated the said results. It is roughly factored into the scikit-learn fork right now, but once the devs discuss and give me the green light, I can open a PR, where the diff will be more obvious.
I think that SPORF does not fulfil the inclusion criterion: https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
As stated in the documentation, you could create a scikit-learn-contrib package that would depend on scikit-learn or if the maintenance of the full package is too much, check for inclusion in scikit-learn-extra. This would give visibility to the project.
that doesn't affect non-oblique RF in any way
When I looked at the PR, it looked to me like if it was a lot of duplicated code. I thought that it could be possible to inherate more functionality from the scikit-learn code base (but I did not look enough to have useful feedback thought).
Hi @glemaitre I am Adam, part of the @neurodata group.
I think that SPORF does not fulfil the inclusion criterion: https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
Thank you for your comment. We apologize for the confusion here. This issue and PR is primarily a Forest-RC
cython implementation, with some additional functionality. I went ahead and edited the proposal to be a bit more clear.
The proposal here is for specifically Forest-RC
, which was introduced by Breiman in 2001. Forest-RC
complements and builds on top of the Forest-RI
algorithm (called RandomForestClassifier/Regressor
in scikit-learn). Since that paper has over 75,000 citations and has been monumental in the field, many papers and experiments have demonstrated the empirical performance of variants of this Forest-RC
, which further validate the strength of the Forest-RC
approach.
When I looked at the PR, it looked to me like if it was a lot of duplicated code. I thought that it could be possible to inherate more functionality from the scikit-learn code base (but I did not look enough to have useful feedback thought).
I have tidied up a test PR a bit to show the diff a bit easier. https://github.com/neurodata/scikit-learn/pull/11/files.
In the PR, the main actual code/logic that’s added is in the files that handle “oblique splitting”. This enables one to sample random features and then combine them before performing the Random Forest procedure. Right now it duplicates some portions of the code because I presume it's better to not touch the highly optimized Random Forest code.
However, I can imagine with slight refactoring of the underlying Random Forest tree code to allow for weights applied to each split (currently they are just unit vectors), we can allow Oblique trees to take up less LOC.
WDYT?
Hi @glemaitre just wanted to follow up on this thread now that v1.0 has been released (congrats and awesome milestone for the package). Happy to jump on a zoom call if that's easier too!
Hi @glemaitre just wanted to follow up again on this. Just wanted to reiterate my above points where this algorithm ('Forest-RC') is the one proposed in Breiman's 2001 paper and also has a good empirical support from the cc18 benchmark. I think there's some issues to sort out regarding how to best factor in this code.
I'm happy to submit a PR for initial skim review and then possibly break up the cython portions into smaller PRs for easier review and merges.
Comments from call w/ discord sklearn OH (general) @glemaitre @adrinjalali @thomasjpfan:
The goal here is to refactor the complicated tree code currently to support easier maintenance and then allow inclusion of more exotic features at the tree level.
Possible order of PRs to first refactor:
^ @adam2392 to "try" at refactoring easier to harder
Hi @thomasjpfan just wanted to see if there was any initial draft that you had made that wasn't finalized, so I can kinda see where this might go. Happy to contribute thoughts too!
Also, FYI: https://github.com/neurodata/scikit-learn/pull/11 has a test draft PR demonstrating how I "naively" implemented oblique trees on top of the existing axis-aligned trees. In case there's some inspiration for cases the design doc might need to account for?
There are several other items on my TODO list before I get to the tree design. I should have a design ready to show in 2-3 weeks.
In case there's some inspiration for cases the design doc might need to account for?
Your draft PR will be something I will keep in mind when designing the trees. I want to make sure the design will be extendable for your use case.
Hi @thomasjpfan just wanted to check in. Or would OH be a better time to discuss?
Hope you had a good thanksgiving @thomasjpfan. Just wanted to check in on the status of this?
There are a few more constraints I need to work out:
I have finally gotten through most of my backlog, so the tree design + prototype is on my agenda this week.
Sounds good, happy to discuss and help implement when this is finalized! Exciting.
To keep a written record, I also need to look at:
I am trying to think of a good roadmap to get a tree refactor going since the tree code is very coupled together. There is a lot of code to refactor to satisfy all the use cases. The refactor would make new features easier but the first PR will take a long time to review.
SplitRecord
being a struct is useful because it does not require the gil
to create. I think are ways to work around this and make SplitRecord
into a class. Even with a workaround, I think one still needs custom logic for the a custom split record, which needs a custom Splitter and a custom Tree
object to understand the custom record type. In the end I see it going back to what was already done in https://github.com/neurodata/scikit-learn/pull/11 There is a lot of flexible you get from copying the existing code and making the modifications where you see fit.
I am trying to think of a good roadmap to get a tree refactor going since the tree code is very coupled together. There is a lot of code to refactor to satisfy all the use cases. The refactor would make new features easier but the first PR will take a long time to review.
Sounds good. Happy to help implement this!
In the end I see it going back to what was already done in neurodata#11
Is this because cython doesn't support enough c++ features yet? I was thinking that if you can sub-class the SplitRecord
, and Splitter
, then the Tree
object should work out of the box? In https://github.com/neurodata/scikit-learn/pull/11, the Tree
code is actually all the same except for the type of SplitRecord and Splitter.
I can come to OH to discuss more if you'll be there?
I can come to OH to discuss more if you'll be there?
I'll be attending so we can chat there. In the ObliqueTree
case, it needs to know how to add a node based on the custom SplitRecord
, which is what you did with _add_oblique_node
. If we want to be fully generic, then the SplitRecord
should be passed into _add_node
and we let the tree decide how it wants to add the record. ObliqueTree
also needs to have its own apply
functions because it has additional weights to take into account. For the ObliqueTree, I think only the builder is generic enough to not need a rewrite as long as the
SplitRecord` is passed around.
Stepping back a little, I see ObliqueTree
meets inclusion through citations of the original Random Forest paper, but including it will increase the maintenance cost of the library. I think Oblique trees needs a story around "How should a user decide between the current trees and the oblique ones" to be considered for inclusion. From the benchmarks you shown and from the "Sparse Projection Oblique Randomer Forests" paper, it look like it is another hyper-parameter to tune.
Since you have a working implementation for oblique trees, I think you can create a python package and release it. This would be the fastest way to get something released and get feedback from users.
I'll be attending so we can chat there.
Sounds good. I will be there.
In the
ObliqueTree
case, it needs to know how to add a node based on the customSplitRecord
, which is what you did with_add_oblique_node
. If we want to be fully generic, then theSplitRecord
should be passed into_add_node
and we let the tree decide how it wants to add the record.ObliqueTree
also needs to have its ownapply
functions because it has additional weights to take into account. For theObliqueTree, I think only the builder is generic enough to not need a rewrite as long as the
SplitRecord` is passed around.
Sounds like a good plan and idea to me. We are happy to work with you extensively on implementations and refactoring.
This generalization could then very easily allow researchers to generalize the sklearn implementation and build their own Splitters that can then be housed in another repository. For example, one can imagine a number of "new research" Splitters that sample the feature indices, and weights in different ways. The base of that would be the Oblique trees that Breiman proposed as Forest-RC.
Stepping back a little, I see
ObliqueTree
meets inclusion through citations of the original Random Forest paper, but including it will increase the maintenance cost of the library. I think Oblique trees needs a story around "How should a user decide between the current trees and the oblique ones" to be considered for inclusion. From the benchmarks you shown and from the "Sparse Projection Oblique Randomer Forests" paper, it look like it is another hyper-parameter to tune.
Due to papers attached we believe in general the recommendation should be to use oblique trees rather than axis aligned. These are all papers that explore different types of "oblique" splits and compare them with traditional axis-aligned trees. In general, oblique forests works better, specifically when most of the features are continuous. If there is a separating hyperplane that is not axis-aligned, then axis-aligned trees require many splits to approach this hyperplane vs hypothetically 1 oblique split. Empirically, oblique trees also tend to be smaller. The caveat is the extra computation time which we can run more experiments to explore. However, even on the CC18 dataset, we demonstrate that the extra computation time is not "that" much more. One could drastically speed up computation time by applying binning on the feature space.
We can also fix the sparsity of the obliqueness to some specific number to remove the necessity of tuning this hyperparameter. Note I didn't tune this hyperparameter at all in the CC18 ML benchmark shown where oblique trees outperform RF.
Keeping notes from Discord discussion: The general issue is regarding inclusion of OF and how to tell users about RF vs OF. How can we convince core dev team that adding this complexity is worth it.
make regression
and make classification
or scikit-learn datasets modules to make datasets that linearly connected and also scale higher, so we can demonstrate how OF scales vs RF for training and prediction for more dataLmk if I missed anything.
@fwood I suspect will have deeply insightful things to say about Oblique Trees/Forests
I've updated the description of the GH issue with two gist notebooks: one benchmarking the expected runtime performance as a function of n_features
and n_samples
, and one comparing the decision surface of an axis-aligned and oblique tree on the iris dataset.
These two taken together motivate that oblique trees are not that much more computationally expensive in general. Moreover, the visualization example with the iris dataset would serve as a motivating example to users for how to think about oblique trees/splits. The goal is to demonstrate to the core-devs that inclusion of oblique trees is well-motivated.
Runtime benchmarking dependency on number of features and number of samples: The notebook is shown in this gist: https://gist.github.com/adam2392/a652aa1c88ba94f0d6aab96ccd3ade24
visualizing the intuitive differences between Oblique and Axis-aligned trees: The following notebook shows the decision surface of an oblique tree vs axis-aligned tree: https://gist.github.com/adam2392/78519091104cecfeb8ff796eac7e8115. It would educate an incoming user when/how oblique trees are different. We can extend the existing example in sklearn.
I'm pinging the people that was suggested as core-devs involved in the tree code to get a firm answer on inclusion: @NicolasHug , @amueller , @ogrisel , @glemaitre , @adrinjalali and @thomasjpfan. Lmk if I missed anything, or anyone.
As discussed in OH on 12/6/2021, we realize that the modification of the tree code will involve i) refactoring the tree code, ii) adding a PR for oblique trees functionality along with tests and iii) updating examples to discuss axis-aligned vs oblique trees. We are happy to help in refactoring and improving the tree code and believe this series of PRs will be a great improvement to scikit-learn paving the way for downstream scikit-tree/forest based packages.
Moreover, I realize that there are some big projects going on for scikit-learn, so the goal of this comment is to get on the same page first regarding inclusion given our evidence provided from i) publications, ii) cc_18 benchmark, iii) runtime benchmarks and iv) intuitive visualization.
Wanted to ping the above, since it's been awhile. In summary:
Since the main issue is the maintenance, I've done some work in making the upgrade simpler and modular. Are there any issues I haven't covered? I would like to get some developer opinions before ppl forget :p.
The core issue is that fusedtype are not allowed in extension types (i.e. Cython classes). However, this issue has been circumvented with Tempita (see: https://github.com/scikit-learn/scikit-learn/pull/20481). Moreover, with some minor refactoring of the tree codebase, I can demonstrate that the oblique upgrade is a lot easier then previously shown.
I think the citation criterion is a bit tricky because the original paper isn't cited for this algorithm, but I think it this meets criteria. If the maintenance burden isn't too crazy, I think we should try to include it.
For those interested, here is the built docs describing the oblique trees: https://output.circle-artifacts.com/output/job/78b93aa0-e14c-4b76-a3ba-86b87f501d88/artifacts/0/doc/modules/tree.html#oblique-trees
Describe the workflow you want to enable
I am part of the @neurodata team. We are proposing to add a cythonized module that allows for building oblique trees.
Oblique trees, otherwise known as
Forest-RC
, originally proposed by, Breiman 2001 [1], would enable a more flexible class of decision trees. Breiman called a version of these Forest-RC, as compared to what most people call random forest, which he called Forest-RI. He noted:"Forest-RC does exceptionally well on the synthetic data sets. Overall, it compares more favorably to Adaboost than Forest-RI."
In other words, in Breiman's paper introduction random forest over 20 years ago, with over 75,000 citations, he presented 2 algorithms, and argued that the one which does sparse linear combinations of features worked better than the naive one which most people have used since. More recently, several authors have corroborated these original results [2-6]. Oblique trees in general open up a wide class of flexibility for trees to take advantage of structure in data.
[1]: Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324 [2]: T. M. Tomita, J. Browne, C. Shen, J. Chung, J. L. Patsolic, B. Falk, J. Yim, C. E. Priebe, R. Burns, M. Maggioni, and J. T. Vogelstein. Sparse Projection Oblique Randomer Forests. Journal of Machine Learning Research, 2020. [3]: Tyler M Tomita, Mauro Maggioni, and Joshua T Vogelstein. Roflmao: Robust oblique forests with linear matrix operations. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 498–506. Society for Industrial and Applied Mathematics, 2017. [4]: Rainforth, Tom & Wood, Frank. (2015). Canonical Correlation Forests. [5]: Perry, R., Li, A., Huynh, C., Tomita, T.M., Mehta, R., Arroyo, J., Patsolic, J., Falk, B., & Vogelstein, J. (2019). Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks. [6]: Menze, Bjoern & Kelm, Bernd & Splitthoff, Daniel & Köthe, Ullrich & Hamprecht, Fred. (2011). On Oblique Random Forests. 453-469. 10.1007/978-3-642-23783-6_29.
Describe your proposed solution
The implementation of Forest-RC (oblique forests) was ported into a scikit-learn fork (https://github.com/neurodata/scikit-learn/pull/11/). The Forest-RC Cython implementation currently builds cleanly on top of the existing cython code for the regular Decision Trees and thus does not affect the runtime, or performance on Random Forests at all. Moreover, it improves upon RF in certain cc_18 benchmarks (see figure below).
The
.pxd
files can also be included in each scikit-learn release, so that other sklearn-compatible forest modules can build upon our implementation of the Oblique tree.Relevant Benchmarks
To ease scikit-learn developer concerns on incorporating such a large piece of code change, we can provide the following benchmarks and tests on Cythonized Forest-RC vs Cythonized Forest-RI. These benchmarks were ran on the OpenML CC-18 dataset. Max features were set to n_features, so runtime was very high for both RF and oblique forests in e.g. Fashion-MNIST.
Runtime benchmarking dependency on number of features and number of samples: In addition, we performed a benchmarking the runtime performance of oblique trees/forests using the code provided in
benchmarks/bench_tree.py
. The notebook is shown in this gist: https://gist.github.com/adam2392/a652aa1c88ba94f0d6aab96ccd3ade24visualizing the intuitive differences between Oblique and Axis-aligned trees: Per a discussion with @thomasjpfan , he wanted to see how we can educate an incoming user when/how oblique trees are different. We can extend the existing example in sklearn. The following notebook shows the decision surface of an oblique tree vs axis-aligned tree: https://gist.github.com/adam2392/78519091104cecfeb8ff796eac7e8115
Additional context
cc: @jovo
Some action items are: