neurodata / treeple

Scikit-learn compatible decision trees beyond those offered in scikit-learn
https://treeple.ai
Other
63 stars 14 forks source link

Implement split nodes that can consider categorical features #90

Open adam2392 opened 1 year ago

adam2392 commented 1 year ago

We would need to enable this in the sklearn fork's splitter. The original PR in upstream sklearn was never merged unfortunately: https://github.com/scikit-learn/scikit-learn/pull/12866.

  1. Generalize the "threshold of the split" as a threshold, or a categorical bit selector
  2. Implement Breiman's shortcut for binary classification with categorical splits
  3. Implement the general categorical split that evaluates up to 2^8 possible random categories for splitting
  4. Implement the Python API layer in BaseDecisionTree and follow the HistGradientBoosting* API patterns
adam2392 commented 1 year ago

Will be closed by: https://github.com/neurodata/scikit-learn/pull/46

adam2392 commented 1 year ago

A benchmarking done using cc18's openml dataset with categorical features would be nice: https://github.com/scikit-learn/scikit-learn/pull/12866#issuecomment-455350207

Basically run sklearn w/o categorical support and one-hot encoding vs w/ categorical support

compare both.

jovo commented 1 year ago

Consider https://arxiv.org/pdf/1908.09874v3.pdf