jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
685 stars 113 forks source link

CategoricalDomain performance in transform step #431

Open ghost opened 1 month ago

ghost commented 1 month ago

Hi,

I'm currently working with a larger dataset which has numerous categorical features, some of them with many categories. I have set up the pipeline with the corresponding decorators for each feature, and using xgb.

I noticed that the CategoricalDomain decorator spends a lot of time in the transform step. I did a bit more digging in the code, and found out that most of the time spend in _compute_masks, specifically in computing the valid mask (_valid_value_mask). I'm using the CategoricalDomain decorator with invalid_value_treatment='as_is' in which case the valid/invalid masks are not really needed as there is no transformation happening.

Would it be possible to skip the step of calculating the valid/invalid mask in case invalid_value_treatment is set to 'as_is'?

vruusmann commented 1 month ago

Just to be sure, you're experiencing this bad performance issue when using the latest SkLearn2PMML version (currently 0.110.0)?

and-ruid commented 1 month ago

sorry, forgot to mention the version. Yes, I'm using the latest 0.110.0

and-ruid commented 1 month ago

Here is some timing for the performance. 99% of the time of the mapper step is spent in CategoricalDomain's transform.

[Pipeline] ............ (step 1 of 2) Processing mapper, total=20.6min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  36.2s

For categorical features with the CategoricalDomain decorator, timings look like this (I know there is a really 'bad' feature with tons of categories):

2024-08-18 17:47:23,358:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 1.369734 secs
2024-08-18 17:47:33,321:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 9.88299 secs
2024-08-18 18:06:07,930:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 1106.247428 secs

Timings for numerical features with the ContinuousDomain decorator are in a normal range:

2024-08-18 18:06:11,677:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.019175 secs
2024-08-18 18:06:11,706:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.017644 secs
2024-08-18 18:06:11,748:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.020861 secs
vruusmann commented 1 month ago

The Domain._compute_masks()X method returns a 3-tuple of boolean arrays (with the elements representing missing mask, valid mask and invalid mask).

Indeed, in case of invalid_value_treatment = "as_is" there is no need to distinguish between valid and invalid subspaces (only missing vs. non-missing is needed). In such a situation, the second and third element of the tuple could be set to None values (instead of boolean arrays). And the Domain.transform(X) method could simply skip a value space if the corresponding mask is None

@a-rudnik Can you implement something along those lines locally, and run your benchmarks again? This way you can be sure that the fix is relevant/sufficient.

However, the valid subspace mask is calculated using the numpy.isin(x, values) method. Is the time really spent in there, or somewhere around it?

and-ruid commented 1 month ago

I had added some time measurements to the code. There is certainly also some time spent in the computation of the other masks, but that becomes insignificant with more categories. Most time is spent in calculating the valid mask:

Code block '_isin_mask' took: 491.21950 ms
Code block '_valid_value_mask' took: 584.56533 ms
Code block '_compute_masks' took: 791.44471 ms
2024-08-21 09:24:57,090:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 1.070773 secs
Code block '_isin_mask' took: 9017.28933 ms
Code block '_valid_value_mask' took: 9093.18437 ms
Code block '_compute_masks' took: 9303.03583 ms
2024-08-21 09:25:06,774:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 9.605799 secs
Code block '_isin_mask' took: 1103150.89917 ms
Code block '_valid_value_mask' took: 1103226.58283 ms
Code block '_compute_masks' took: 1103420.87033 ms
2024-08-21 09:43:36,908:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 1103.709195 secs

Following are the changes I made to the code (diff output):

180c180
<       elif self.invalid_value_treatment == "as_is":
---
>       elif (self.invalid_value_treatment == "as_is") or (self.invalid_value_treatment == "as_missing" and self.missing_value_treatment == "as_is"):
196,197c196,201
<       valid_mask = self._valid_value_mask(X, nonmissing_mask)
<       invalid_mask = ~numpy.logical_or(missing_mask, valid_mask)
---
>       if (self.invalid_value_treatment == "as_is") or (self.invalid_value_treatment == "as_missing" and self.missing_value_treatment == "as_is"):
>           valid_mask = None
>           invalid_mask = None
>       else:
>           valid_mask = self._valid_value_mask(X, nonmissing_mask)
>           invalid_mask = ~numpy.logical_or(missing_mask, valid_mask)

Now the performance is much better:

Code block '_compute_masks' took: 204.30683 ms
2024-08-21 10:37:12,503:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.746888 secs
Code block '_compute_masks' took: 207.54517 ms
2024-08-21 10:37:13,380:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.799804 secs
Code block '_compute_masks' took: 184.08946 ms
2024-08-21 10:37:20,148:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.798046 secs