jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
891 stars 255 forks source link

Not happy with XGBoost evaluation performance #256

Closed paranjapeved15 closed 1 year ago

paranjapeved15 commented 1 year ago

Our Setup We are running inference on a few thousand samples at a time in a kotlin microservice with jpmml. There are about 7-8 input features of numeric type. Target variable is a probability value. Multi threading: We have multiple threads each calling the model evaluator#evaluate JPmml Version: 1.5.15

Model details Our model is an XGB Classifier, with learning rate 0.02, ~315 number of estimators, max depth = 30 and and monotone constraints for 2 of the features. We have created a DataFrameMapper through which we are doing some simple manipulations (eg. value imputations/clipping of features). We have converted the xgb model to a PMML format using the sklearn2pmml package. Previously we were using a logistic regression classifier.

Size Comparison Logistic Regression Pmml Size: 25 kb Tree based model Pmml Size: 16 mb

Inference Performance Logistic regression models on an average were performing much better, each sample was taking about 100 nano seconds to score. New Tree based models on an average are taking about 2 milli seconds, ie about 20 times more time.

Questions Is it just the sheer size of the pmml which is deteriorating performance? Is it possible to improve the inference time of the tree based models in some way? I read on one of your issues that vector processing is not possible in a java environment. Is there any other way we can improve parallel processing of the models?

vruusmann commented 1 year ago

Is it just the sheer size of the pmml which is deteriorating performance?

Based on your report:

In other words, the complexity of the model grew 640 X, whereas the evaluation time grew 20 X. Not a bad trade-off, I would say.

Is it possible to improve the inference time of the tree based models in some way?

First, while exporting your Python model using SkLearn2PMML, it is possible to choose between different representations. Some reps optimize for readability (human-friendliness), some for evaluation performance.

For example, XGBoost models can represent splits in different ways, plus XGBoost decision trees themselves can be rearranged to make them flatter and more compact.

Second, for performance critical stuff, use JPMML-Transpiler library (on top of the JPMML-Evaluator library).

I read on one of your issues that vector processing is not possible in a java environment

You can vectorize linear algebra operations (eg. logistic regression), but you can't vectorize conditional operations (eg. decision trees such as XGBoost).

Doesn't matter what's the front-end API (native XGBoost on GPU vs. JPMML on CPU), the evaluation of XGBoost models always happens one data data record-at-a-time.

Is there any other way we can improve parallel processing of the models?

  1. Review your Python data science workflow.
  2. When exporting Python objects to PMML using the SkLearn2PMML package, choose a performance-oriented model representation.
  3. Use JPMML-Transpiler when the model is transpile-able (XGBoost falls nicely into this category).
  4. Buy my professional consultation services.
paranjapeved15 commented 1 year ago

Thanks @vruusmann for your recommendations. Question: Does jpmml library have provision to execute each tree in the ensemble in parallel?

vruusmann commented 1 year ago

Question: Does jpmml library have provision to execute each tree in the ensemble in parallel?

Currently, NO.

The thinking is that it would consume more time to co-ordinate work between threads, than it takes to do the work in one thread.

In your example, it takes 20 millis to do 315 elementary trees. That is ~0.00635 millis (6.35 micros) per tree. What is your estimate, how much time it would take to split/join this between 315 threads? Multi-threading won't make a single tree evaluate any faster.

Sure, perhaps there's a reasonable trade-off by splitting the work between 3 threads (not 315), so that each thread does ~105 trees.

But since your application scenario is about batch scoring 1000 data records, then you better figure out a mechanism for dividing them between the right amount of threads (while treating the evaluation of each data record as an atomic operation).