Closed johncliu closed 6 years ago
My setup for reference:
python: 3.5.4
sklearn: 0.18.1
sklearn.externals.joblib: 0.10.3
pandas: 0.21.1
sklearn_pandas: 1.6.0
sklearn2pmml: 0.28.0
java -cp /root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/guava-20.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/istack-commons-runtime-3.0.5.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-core-2.3.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-runtime-2.3.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-converter-1.2.6.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.1.3.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.4.2.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.2.4.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-agent-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-schema-1.3.8.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pyrolite-4.19.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/serpent-1.18.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-buwtw3w9.pkl.z --pmml-output pipeline.pmml
When I run the pipeline without feature selection, the results match perfectly.
Very interesting observation.
What happens if you replace the "direct use" of SelectKBest
with an "indirect use" of SelectorProxy(SelectKBest())
? The meta-selector class SelectorProxy
shields you from the internals of the actual feature selection logic.
Please try rearranging your code like this, and report back!
from sklearn2pmml import SelectorProxy
pipeline = PMMLPipeline([
("vectorizer", vectorizer),
("feature_selector", SelectorProxy(feature_selector)), # THIS!
("classifier", classifier)
])
Using SelectorProxy(feature_selector), the results align perfectly between sklearn and jpmml-sklearn:
k | sklearn | pmml |
---|---|---|
90 | 0.8055624 | 0.8055624 |
95 | 0.8011768 | 0.8011768 |
100 | 0.8011904 | 0.8011904 |
105 | 0.7944084 | 0.7944084 |
110 | 0.7970723 | 0.7970723 |
150 | 0.7964169 | 0.7964169 |
If that's the suggested workaround, we'll go with it. Thanks!
Using SelectorProxy(feature_selector), the results align perfectly between sklearn and jpmml-sklearn:
Thanks for reporting back such great news!
The results between Scikit-Learn and (J)PMML should actually align up to 14th or 15th decimal place (you're only checking the first seven decimal places). In the future, if you continue your research and happen to find a discrepany in the area of 12th or 13th decimal place, then you should let me know about it again.
If that's the suggested workaround, we'll go with it.
Apparently, the JPMML-SkLearn library handles the SelectKBest(score_func = chi2)
case incorrectly.
There are several other bug reports about Scikit-Learn and (J)PMML prediction mismatches, and all these pipelines appear to contain the SelectKBest(score_func = chi2)
step:
https://github.com/jpmml/sklearn2pmml/issues/69#issue-276313176
https://github.com/jpmml/sklearn2pmml/issues/68#issuecomment-346227053
Yup, SelectKBest could also be causing those discrepancies in #68 and #69. I'll take a look at SelectKBest.java to see if I can track it down, but in the meanwhile will close this ticket given the workaround with SelectorProxy(). Thanks!
Similar to #82, I noticed a sizable inconsistency when I incorporated SelectKBest feature selection with the Logistic Regression classifier and altered the number of features k.
I'm using the following sklearn snippet:
and jpmml snippet:
For the sentence above, the class 1 predictions for different values of k are:
When I run the pipeline without feature selection, the results match perfectly. I ran this across multiple datasets and got the same strange behavior. Below is a 100 line training file (extracted from the UofMich Sentiment Analysis Challenge corpus on Kaggle) that I used to generate above results:
and read with this python snippet: