Add support for HistGradientBoostingRegressor

samuelefiorini commented 1 year ago

I use Greykite to forecast hourly time-series with years of historical data and fit_algorithm=gradient_boosting is very slow.

According to sklearn.ensemble.HistGradientBoostingRegressor

This estimator is much faster than GradientBoostingRegressor for big datasets (n_samples >= 10 000).

have you considered adding support for this estimator? It looks straightforward from here, but I may be wrong.

amyfei2015 commented 1 year ago

Thanks for the suggestion! We haven't planed for this yet, but we now take a note. Will update with you if we have this feature implemented. In the meanwhile please feel free to submit a pull request for this feature change if you need to use that. Thanks!

samuelefiorini commented 1 year ago

Thanks, I did some experiments (here) and I've been able to make it run (it's far from being a PR though). In my case (hourly forecast with 2+ years of historical data) HistGradientBoostingRegressor is way faster than GradientBoostingRegressor (around 4x) while it has roughly the same performace in backtest.

However, there are also some points of discussion. For instance: due to its implementation, HistGradientBoostingRegressor does not offer a native feature importance measure. While both GradientBoostingRegressor and RandomForestsRegressor do.

A possible approach would be to rely on something like sklearn.inspection.permutation_importance, but this of course comes with higher computational cost, and it's probably not ideal. Otherwise a dummy empty array may be used, maybe raising some warning to inform the user.

samuelefiorini commented 1 month ago

It’s been a while, but the issue regarding the addition of feature_importance in HistGradientBoosting* estimator is still open on scikit-learn: 15132. I’m adding this here for future reference.

linkedin / greykite

Add support for HistGradientBoostingRegressor #105