HistGradientBoostingClassifier slow in prediction mode

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

https://scikit-learn.org

BSD 3-Clause "New" or "Revised" License

60.19k stars 25.42k forks source link

HistGradientBoostingClassifier slow in prediction mode #16429

Open SebastianBr opened 4 years ago

SebastianBr commented 4 years ago

While HistGradientBoostingClassifier is 100 faster than GradientBoostingClassifier when fitting the model, I found it to be very slow in case of predicting the class probabilities, in my case about 100 times slower :-(

For example: GradientBoostingClassifier: 3.2 min for training for 1 million examples. 32 ms for 1000 predictions. HistGradientBoostingClassifier: 7s for training. 1s for 1000 predictions

ogrisel commented 4 years ago

Indeed the prediction latency was optimized for GradientBoostingClassifier and we haven't really done similar work for the newer HistGradientBoostingClassifier.

Just to clarify, do you predict on a batch of 1000 samples in a single numpy array or do you call 1000 times the predict method with 1 sample at a time?

ogrisel commented 4 years ago

We need to do a profiling session with py-spy with the --native flag to spot the performance bottleneck:

https://github.com/benfred/py-spy

ogrisel commented 4 years ago

ping @jeremiedbb

NicolasHug commented 4 years ago

Thanks for the report @SebastianBr . Do you observe a similar difference when predicting for 1 million samples instead?

The prediction of the HistGB is multi-threaded, but the regular GB isn't. With 1000 samples, the thread spawning might be too much of an overhead. If you have time, I'm interested to see what it does if you set OMP_NUM_THREADS to 1.

@ogrisel , just a side note, we parallelize the _raw_predict (i.e. decision_function), but not the decision_function_to_proba method which is sequential. There's room for improvement here, though I doubt this is the issue

Indeed the prediction latency was optimized for GradientBoostingClassifier

Curious, what are you referring to?

SebastianBr commented 4 years ago

Just to clarify, do you predict on a batch of 1000 samples in a single numpy array or do you call 1000 times the predict method with 1 sample at a time? I predict on a batch.

I tested it again on larger datasets. Starting from 100k GB and HGB are indeed about equally fast. It seems there is some overhead that is probably not relevant in most cases. But since I often have smaller batches of 10 to 1000 examples and need in predictions in real-time HGB isn't in good choice for me currently.

Do you see any chance for an improvement?

ogrisel commented 4 years ago

@ogrisel , just a side note, we parallelize the _raw_predict (i.e. decision_function), but not the decision_function_to_proba method which is sequential. There's room for improvement here, though I doubt this is the issue

I agree.

Indeed the prediction latency was optimized for GradientBoostingClassifier Curious, what are you referring to?

I remember that @pprett spent a lot of time profiling the predict method of GradientBoostingClassifier to make sure that prediction latency on small batches would be as low a possible.

SebastianBr commented 4 years ago

Oh I see I haven't tried OMP_NUM_THREADS. It isn't a parameter, so where can I set it?

ogrisel commented 4 years ago

OMP_NUM_THREADS is an environment variable. Try to set it to 1 to run in sequential mode on small batches.

NicolasHug commented 4 years ago

Oh I see I haven't tried OMP_NUM_THREADS. It isn't a parameter, so where can I set it?

It's an environment variable, you can do OMP_NUM_THREAD=1 python the_script.py

Do you see any chance for an improvement?

Yes. The first obvious thing is to check whether this is indeed a thread overhead issue, and if so, only parallelize the code if the number of samples is high enough. The second one is to pack the tree structure so that it optimizes cache hits and stuff.

SebastianBr commented 4 years ago

Thanks for the info. I benchmarked the prediction with OMP_NUM_THREAD=1 and it was actually slower. For example, the smallest batch I tested of size 5 was 25 ms with one thread and 16 ms with multithreading.

So it seems the overhead lies in the method itself.

NicolasHug commented 4 years ago

interesting, thanks

I'll run a few benchmarks on my side. From what I remember, we're consistently faster than LightGBM and XGBoost on prediction (but I haven't tested against the regular GB estimators)

stonebig commented 10 months ago

there is a recent article (april 2023) comparing catboost / lightbm / xgboost / sklearn Gradient Boosting / Sklearn Hierarchical GB

https://kr-uttam.medium.com/boosting-techniques-battle-catboost-vs-xgboost-vs-lightgbm-vs-scikit-learn-gradientboosting-vs-c106afc85dda

https://gist.githubusercontent.com/uttamkumar15/9450916993077a6f6a31bf94cbe30927/raw/062c0b624977e7c1d7e4bb746f4195e02eefdd37/boosting%20comparison

stonebig commented 10 months ago

and a study published on arxiv in may 2023 https://arxiv.org/pdf/2305.17094.pdf

apparently xgboost can use GPU, but it doesn't mean it's better than sklearn.

https://github.com/rapidsai/cuml/issues/5374#issuecomment-1520108700