scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.43k stars 25.26k forks source link

HistGradientBoostingClassifier crushes on datasets with large number of input features #18703

Closed xuyxu closed 3 years ago

xuyxu commented 3 years ago

Describe the bug

On Sector dataset with the number of input features 55197, the HistGradientBoostingClassifier crushes when initializing the array of histograms on the side of Cython: Pivot.

To reproduce the error, the Sector dataset is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/sector/sector.scale.bz2

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import load_svmlight_file
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

def load_sector_train():
    train = load_svmlight_file('sector.scale.bz2')

    X_train = train[0].toarray()
    y_train = train[1] - 1

    return X_train, y_train.astype(np.int)

if __name__ == '__main__':

    X_train, y_train = load_sector_train()

    model = HistGradientBoostingClassifier(max_iter=100,
                                           loss='categorical_crossentropy',
                                           validation_fraction=None,
                                           random_state=0)

    model.fit(X_train, y_train)

Results

Traceback (most recent call last):

  File "FILENAME", line 64, in <module>
    model.fit(X_train, y_train)

  File "C:\Software\Anaconda\lib\site-packages\sklearn\ensemble\_hist_gradient_boosting\gradient_boosting.py", line 356, in fit
    grower = TreeGrower(

  File "C:\Software\Anaconda\lib\site-packages\sklearn\ensemble\_hist_gradient_boosting\grower.py", line 251, in __init__
    self._intilialize_root(gradients, hessians, hessians_are_constant)

  File "C:\Software\Anaconda\lib\site-packages\sklearn\ensemble\_hist_gradient_boosting\grower.py", line 334, in _intilialize_root
    self.root.histograms = self.histogram_builder.compute_histograms_brute(

  File "sklearn\ensemble\_hist_gradient_boosting\histogram.pyx", line 135, in sklearn.ensemble._hist_gradient_boosting.histogram.HistogramBuilder.compute_histograms_brute

MemoryError: Unable to allocate 270. MiB for an array with shape (55197, 256) and data type [('sum_gradients', '<f8'), ('sum_hessians', '<f8'), ('count', '<u4')]

Versions

cython = 0.29.21 scikit-learn = 0.23.2

EDIT: It takes a while for the program to return the error message.

ogrisel commented 3 years ago

We recently improved the memory efficiency of this estimator. Can you please try again with our nightly build?

https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds

ogrisel commented 3 years ago

I just tried your code snippet and it seems that the memory usage is kind of constant. However, if you enable the verbose=1 option you will see that it's going to take a link time to fit...

The training set has many features:

>>> X_train.shape
(6412, 55197)

without the Exclusive Feature Bundling feature of LightGBM, this is going to be very slow in scikit-learn. I would rather advise to try to do some feature selection first (maybe by fitting a random forest that should be more efficient thanks for random feature subsampling) and identify the top 1000 features using permutation_importance on a validation set and finally fit a HGBRT model on those most predictive features.

xuyxu commented 3 years ago

Thanks for your advice. In addition, I think it would be nice if HGBRT is scikit-learn could support feature sub-sampling per boosting iteration, which is similar to the feature_fraction parameter in LightGBM :-)

NicolasHug commented 3 years ago

support feature sub-sampling per boosting iteration

This is tracked in https://github.com/scikit-learn/scikit-learn/issues/16062 (as you know ;) ).

I'll close the issue since the original comment has been addressed.