Implement split nodes that can consider categorical features

adam2392 commented 1 year ago

We would need to enable this in the sklearn fork's splitter. The original PR in upstream sklearn was never merged unfortunately: https://github.com/scikit-learn/scikit-learn/pull/12866.

Generalize the "threshold of the split" as a threshold, or a categorical bit selector
Implement Breiman's shortcut for binary classification with categorical splits
Implement the general categorical split that evaluates up to 2^8 possible random categories for splitting
Implement the Python API layer in BaseDecisionTree and follow the HistGradientBoosting* API patterns

adam2392 commented 1 year ago

adam2392 commented 1 year ago

A benchmarking done using cc18's openml dataset with categorical features would be nice: https://github.com/scikit-learn/scikit-learn/pull/12866#issuecomment-455350207

Basically run sklearn w/o categorical support and one-hot encoding vs w/ categorical support

compare both.

jovo commented 1 year ago

neurodata / treeple