Support `max_bins > 255` in Hist-GBDT estimators and categorical features with high cardinality

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

https://scikit-learn.org

BSD 3-Clause "New" or "Revised" License

59.44k stars 25.26k forks source link

Support `max_bins > 255` in Hist-GBDT estimators and categorical features with high cardinality #26277

Open NicolasHug opened 1 year ago

NicolasHug commented 1 year ago

As originally sketched in https://github.com/scikit-learn/scikit-learn/pull/26268#issuecomment-1520504489 there might be a way to enable support for arbitrary high values of max_bins for both categorical and numerical features. This may not be super critical for numerical features, but this would enable categorical features of arbitrary cardinality, which is desirable.

The rough idea is to internally map an input categorical feature into multiple binned features (probably num_categories // 255 + 1 features) and to update the Splitter and the predictors to treat that group of features as a single feature.

lorentzenchr commented 7 months ago

The number of bins are hardcoded in:

Histograms as 2d-array of shape (n features, n bins)
X binned as 2d-array of dtype=uint8
Bitsets for categorical features as C array[8] of type uint32 (8*32=256)

An alternative to allow for more than 256 bins is therefore

Histogram as 1d array with positions where a feature starts (and ends). This is to save memory a lot (and maybe cache hits).
X binned uint8 and a 2nd larger X binned, e.g. uint16, for features that need it, and a structure that bundles both together to a unified API.
A second extended bitset, similar to the existing one, doubled size and a structure that bundles both together to a unified APi.

NicolasHug commented 7 months ago

Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not.

lorentzenchr commented 7 months ago

Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not.

In memory, the histograms look the same: a contiguous array of hist_struct. The only difference is that we currently might have quite some unused bins.

NicolasHug commented 7 months ago

I'm was more thinking of X_binned rather than about the histograms.

The only difference is that we currently might have quite some unused bins

I assume that using a larger dtype is only going to worsen that problem? (Is that an actual pb in practice?)