Open NicolasHug opened 1 year ago
The number of bins are hardcoded in:
An alternative to allow for more than 256 bins is therefore
Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not.
Updating the underlying data-structure will lead to a different memory footprint, and likely different perf as well. Sounds more risky to me, but if you implement it and benchmark indicate no regression, then why not.
In memory, the histograms look the same: a contiguous array of hist_struct. The only difference is that we currently might have quite some unused bins.
I'm was more thinking of X_binned
rather than about the histograms.
The only difference is that we currently might have quite some unused bins
I assume that using a larger dtype is only going to worsen that problem? (Is that an actual pb in practice?)
As originally sketched in https://github.com/scikit-learn/scikit-learn/pull/26268#issuecomment-1520504489 there might be a way to enable support for arbitrary high values of
max_bins
for both categorical and numerical features. This may not be super critical for numerical features, but this would enable categorical features of arbitrary cardinality, which is desirable.The rough idea is to internally map an input categorical feature into multiple binned features (probably
num_categories // 255 + 1
features) and to update theSplitter
and the predictors to treat that group of features as a single feature.