Closed xuyxu closed 3 years ago
We recently improved the memory efficiency of this estimator. Can you please try again with our nightly build?
https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds
I just tried your code snippet and it seems that the memory usage is kind of constant. However, if you enable the verbose=1
option you will see that it's going to take a link time to fit...
The training set has many features:
>>> X_train.shape
(6412, 55197)
without the Exclusive Feature Bundling feature of LightGBM, this is going to be very slow in scikit-learn. I would rather advise to try to do some feature selection first (maybe by fitting a random forest that should be more efficient thanks for random feature subsampling) and identify the top 1000 features using permutation_importance
on a validation set and finally fit a HGBRT model on those most predictive features.
Thanks for your advice. In addition, I think it would be nice if HGBRT is scikit-learn could support feature sub-sampling per boosting iteration, which is similar to the feature_fraction
parameter in LightGBM :-)
support feature sub-sampling per boosting iteration
This is tracked in https://github.com/scikit-learn/scikit-learn/issues/16062 (as you know ;) ).
I'll close the issue since the original comment has been addressed.
Describe the bug
On
Sector
dataset with the number of input features55197
, theHistGradientBoostingClassifier
crushes when initializing the array of histograms on the side of Cython: Pivot.To reproduce the error, the
Sector
dataset is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/sector/sector.scale.bz2Steps/Code to Reproduce
Results
Versions
cython = 0.29.21 scikit-learn = 0.23.2
EDIT: It takes a while for the program to return the error message.