scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.44k stars 25.26k forks source link

Support `max_bins > 255` in Hist-GBDT estimators and categorical features with high cardinality #26277

Open NicolasHug opened 1 year ago

NicolasHug commented 1 year ago

As originally sketched in https://github.com/scikit-learn/scikit-learn/pull/26268#issuecomment-1520504489 there might be a way to enable support for arbitrary high values of max_bins for both categorical and numerical features. This may not be super critical for numerical features, but this would enable categorical features of arbitrary cardinality, which is desirable.

The rough idea is to internally map an input categorical feature into multiple binned features (probably num_categories // 255 + 1 features) and to update the Splitter and the predictors to treat that group of features as a single feature.

lorentzenchr commented 7 months ago

The number of bins are hardcoded in:

An alternative to allow for more than 256 bins is therefore

NicolasHug commented 7 months ago

Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not.

lorentzenchr commented 7 months ago

Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not.

In memory, the histograms look the same: a contiguous array of hist_struct. The only difference is that we currently might have quite some unused bins.

NicolasHug commented 7 months ago

I'm was more thinking of X_binned rather than about the histograms.

The only difference is that we currently might have quite some unused bins

I assume that using a larger dtype is only going to worsen that problem? (Is that an actual pb in practice?)