neurodata / scikit-learn

scikit-learn-tree fork: A fork that enables extensions of Python and Cython API for decision trees
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

[ENH] Adding binning capabilities to decision trees #23

Open adam2392 opened 2 years ago

adam2392 commented 2 years ago

Describe the workflow you want to enable

I am part of the @neurodata team. Binning features have resulted in highly efficient and little loss in performance in gradient-boosted trees. This feature should not only be used in gradient-boosted trees, but should be available within all decision trees [1].

By including binning as a feature for decision trees, we would enable massive speedups for decision trees that operate on high-dimensional data (both in features and sample sizes). This would be an additional tradeoff that users can take. The intuition behind binning for decision trees would be exactly that of Gradient Boosted Trees.

Describe your proposed solution

We propose introducing binning to the decision tree classifier and regressor.

An initial PR is proposed here: https://github.com/neurodata/scikit-learn/pull/24#pullrequestreview-1009913319 However, it seems that many of the files were copied and it is not 100% clear if needed. Perhaps we can explore how to consolidate the _binning.py/pyx files using the current versions under ensemble/_hist_gradient_boosting/*.

Changes to the Cython codebase

TBD

Changes to the Python API

The following two parameters would be added to the DecisionTreeClassifier and Regressor:

hist_binning=False,
max_bins=255

where the default number of bins follows that of histgradient boosting.

Additional context

These changes can also trivially be applied to Oblique Trees.

References: [1] https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

adam2392 commented 2 years ago

@jovo and @jshinm here's a draft Issue for the binning problem for decision trees.

I am parsing @p-teng 's proposed OG changes. Currently too much copy-pasting from the original "binning" files to make it work for decision trees. sklearn-devs will prefer/want something with a short change. Basically, need to determine if we can just import/short-modify the existing binning code to allow decision trees to be also subbed in.

@p-teng if you have any comments based on when you tried it, please lmk. E.g. If there were any technical reasons you needed to copy/paste the existing binning files from ensemble/_hist_gradient/* into ensemble/* folder. Thanks!

CC: @PSSF23 incase you're interested.

PSSF23 commented 2 years ago

As mentioned in my comment in #22, this branch should be more helpful if you want to compare changes Philip made: https://github.com/PSSF23/scikit-learn-stream/tree/hist

Overall, the work does not look complete to me.

adam2392 commented 2 years ago

Unfortunately the work in #22, Philip's changes are still very much incomplete. There wasn't a need to copy certain code. Moreover, the binning capability needs to be consistent between fit/predict phases. Finally, the code itself does not work or compile and is simply a "skeleton" of what needs to be done.

To eventually make things work... I think this can be as simple as just adding a few LOC to each Python function to enable the trees to bin the data. Possibly refactoring out a private function that does this... I've begun doing this in #25. I think if @jshinm is interested in this kind of scientific software work and getting experience in developing OSS, then this is a good opportunity since the changes needed to make everything work will be considerably less compared to that of Oblique Forests.

We'll see what jovo thinks. This probably will require a bit more work + testing, so idk if I have the time to dedicate to this.

jshinm commented 2 years ago

@adam2392 Thank you for your insights, Adam! This is a definitely interesting project and some guidance on your end will certainly keep the ball rolling 😄

adam2392 commented 2 years ago

Another note from the sklearn devs:

in order to add the binning capabilities to the trees, most likely, we would need to enable it for all trees, meaning RandomForests and ExtraTrees. Then I think we would want to show that there is a relevant performance/efficiency tradeoff.

^ this seems like quite a bit of work, so I think a simple sketch of what the code needs to look like is ideal first and then discussing with sklearn devs is the next step. Ideally someone from NDD can do this because it's a straightforward coding project imo (assuming you know some Cython and how to structure Python files).

adam2392 commented 1 year ago

Seems like predict has some different implementations based on the DTYPE:

https://github.com/scikit-learn/scikit-learn/blob/75ae699281549bf455f0767fa65e78268c8fd704/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx#L16

The issue is that https://github.com/scikit-learn/scikit-learn/blob/b75b5aaddef89931d372ab6a22759eaf8385e6b1/sklearn/ensemble/_hist_gradient_boosting/binning.py#L276-L279 also shows that the BinMapper transforms the datatype to uint8 to conserve RAM, but this would require generalizing the Tree Splitter to accept uint8 DTYPEs, which is not trivial.

I think for now to enable binning at this scale, we can just sacrifice RAM (i.e. create a copy of the dataset that is binned that is of np.float64, which obviously would be more RAM usage unnecessarily).

A path to getting around this is adding a fused type definition for places like this: https://github.com/scikit-learn/scikit-learn/blob/b75b5aaddef89931d372ab6a22759eaf8385e6b1/sklearn/tree/_splitter.pyx#L708

This way, the class can work for both DTYPE_t (i.e. np.float64) and e.g. uint8.