Open ghost opened 1 month ago
Just to be sure, you're experiencing this bad performance issue when using the latest SkLearn2PMML version (currently 0.110.0)?
sorry, forgot to mention the version. Yes, I'm using the latest 0.110.0
Here is some timing for the performance. 99% of the time of the mapper step is spent in CategoricalDomain's transform.
[Pipeline] ............ (step 1 of 2) Processing mapper, total=20.6min
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 36.2s
For categorical features with the CategoricalDomain decorator, timings look like this (I know there is a really 'bad' feature with tons of categories):
2024-08-18 17:47:23,358:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 1.369734 secs
2024-08-18 17:47:33,321:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 9.88299 secs
2024-08-18 18:06:07,930:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 1106.247428 secs
Timings for numerical features with the ContinuousDomain decorator are in a normal range:
2024-08-18 18:06:11,677:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.019175 secs
2024-08-18 18:06:11,706:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.017644 secs
2024-08-18 18:06:11,748:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.020861 secs
The Domain._compute_masks()X
method returns a 3-tuple of boolean arrays (with the elements representing missing mask, valid mask and invalid mask).
Indeed, in case of invalid_value_treatment = "as_is"
there is no need to distinguish between valid and invalid subspaces (only missing vs. non-missing is needed). In such a situation, the second and third element of the tuple could be set to None
values (instead of boolean arrays). And the Domain.transform(X)
method could simply skip a value space if the corresponding mask is None
@a-rudnik Can you implement something along those lines locally, and run your benchmarks again? This way you can be sure that the fix is relevant/sufficient.
However, the valid subspace mask is calculated using the numpy.isin(x, values)
method. Is the time really spent in there, or somewhere around it?
I had added some time measurements to the code. There is certainly also some time spent in the computation of the other masks, but that becomes insignificant with more categories. Most time is spent in calculating the valid mask:
Code block '_isin_mask' took: 491.21950 ms
Code block '_valid_value_mask' took: 584.56533 ms
Code block '_compute_masks' took: 791.44471 ms
2024-08-21 09:24:57,090:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 1.070773 secs
Code block '_isin_mask' took: 9017.28933 ms
Code block '_valid_value_mask' took: 9093.18437 ms
Code block '_compute_masks' took: 9303.03583 ms
2024-08-21 09:25:06,774:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 9.605799 secs
Code block '_isin_mask' took: 1103150.89917 ms
Code block '_valid_value_mask' took: 1103226.58283 ms
Code block '_compute_masks' took: 1103420.87033 ms
2024-08-21 09:43:36,908:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 1103.709195 secs
Following are the changes I made to the code (diff output):
180c180
< elif self.invalid_value_treatment == "as_is":
---
> elif (self.invalid_value_treatment == "as_is") or (self.invalid_value_treatment == "as_missing" and self.missing_value_treatment == "as_is"):
196,197c196,201
< valid_mask = self._valid_value_mask(X, nonmissing_mask)
< invalid_mask = ~numpy.logical_or(missing_mask, valid_mask)
---
> if (self.invalid_value_treatment == "as_is") or (self.invalid_value_treatment == "as_missing" and self.missing_value_treatment == "as_is"):
> valid_mask = None
> invalid_mask = None
> else:
> valid_mask = self._valid_value_mask(X, nonmissing_mask)
> invalid_mask = ~numpy.logical_or(missing_mask, valid_mask)
Now the performance is much better:
Code block '_compute_masks' took: 204.30683 ms
2024-08-21 10:37:12,503:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.746888 secs
Code block '_compute_masks' took: 207.54517 ms
2024-08-21 10:37:13,380:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.799804 secs
Code block '_compute_masks' took: 184.08946 ms
2024-08-21 10:37:20,148:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.798046 secs
Hi,
I'm currently working with a larger dataset which has numerous categorical features, some of them with many categories. I have set up the pipeline with the corresponding decorators for each feature, and using xgb.
I noticed that the CategoricalDomain decorator spends a lot of time in the transform step. I did a bit more digging in the code, and found out that most of the time spend in
_compute_masks
, specifically in computing the valid mask (_valid_value_mask
). I'm using the CategoricalDomain decorator with invalid_value_treatment='as_is' in which case the valid/invalid masks are not really needed as there is no transformation happening.Would it be possible to skip the step of calculating the valid/invalid mask in case invalid_value_treatment is set to 'as_is'?