Support for `TfidfVectorizer.norm` attribute

mathlf2015 commented 6 years ago

I was recently looking for a solution to transfer machine learning model across platforms between python and java. i want to use the TfidfVectorizer .however .the model can fit succsess.but can't save.the code as follows. anaconda python 3.6 linux

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.decoration import ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from sklearn_pandas import DataFrameMapper

testdata = pd.DataFrame({'pet': ['cat aaa', 'dog  ddd', 'dog  ccc', 'fish eee fff', 'cat ccc aaa ddd', 'dog ddd fff', 'cat ccc', 'fish fff'
        ], 'age': [4., 6, 3, 3, 2, 3, 5, 4], 'salary': [90, 24, 44, 27, 32, 59, 36, 27]})

mapper = DataFrameMapper([
        ('pet', TfidfVectorizer()),
        ])
vod_pipeline = PMMLPipeline([
        ("mapper", mapper),
    ("classifier", LogisticRegression()
     )  ])

testdata['label'] = [1,1,1,1,1,0,0,0]
vod_pipeline.fit(testdata,testdata['label'])
print(vod_pipeline.score(testdata,testdata['label']))

sklearn2pmml(vod_pipeline, '11.pmml', with_repr=True,debug=True)

the debug as follows

0.75
python: 3.6.1
sklearn: 0.19.1
sklearn.externals.joblib: 0.11
pandas: 0.23.0
sklearn_pandas: 1.6.0
sklearn2pmml: 0.36.1
java: 1.8.0_131
Executing command:
java -cp /root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/guava-25.1-jre.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/jaxb-api-2.3.0.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/javax.activation-api-1.2.0.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/jaxb-runtime-2.3.0.1.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/jcommander-1.72.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.3.1.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/istack-commons-runtime-3.0.5.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/jaxb-core-2.3.0.1.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.5.4.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.2.1.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-model-1.4.2.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/pyrolite-4.20.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/serpent-1.23.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-model-metro-1.4.2.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-agent-1.4.2.jar:/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-converter-1.3.2.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-vbufsjpg.pkl.z --pmml-output 11.pmml
Standard output is empty
Standard error:
Jul 06, 2018 12:52:57 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Jul 06, 2018 12:52:57 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 30 ms.
Jul 06, 2018 12:52:57 AM org.jpmml.sklearn.Main run
INFO: Converting..
Jul 06, 2018 12:52:57 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: l2
    at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:73)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
    at sklearn.Initializer.encodeFeatures(Initializer.java:41)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:192)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IllegalArgumentException: l2
    at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:73)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
    at sklearn.Initializer.encodeFeatures(Initializer.java:41)
    at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:81)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:192)
    at org.jpmml.sklearn.Main.run(Main.java:145)
    at org.jpmml.sklearn.Main.main(Main.java:94)

Preserved joblib dump file(s): /tmp/pipeline-vbufsjpg.pkl.z
Traceback (most recent call last):
  File "test4.py", line 31, in <module>
    sklearn2pmml(vod_pipeline, '11.pmml', with_repr=True,debug=True)
  File "/root/anaconda3/lib/python3.6/site-packages/sklearn2pmml/__init__.py", line 237, in sklearn2pmml
    raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

mathlf2015 commented 6 years ago

i find the same problem here ,but can't get the idea to sovle this problem https://stackoverflow.com/questions/44560823/generate-pmml-for-text-classification-pipeline-in-python

vruusmann commented 6 years ago

Exception in thread "main" java.lang.IllegalArgumentException: l2

The TfidfVectorizer.norm attribute is not supported.

You have it set to "l2", but you need to set it to None.

mathlf2015 commented 6 years ago

thank you very much . and best regards. i can't solve this problem without your help. and finally the model saved succsess. the code change as follows.

from sklearn2pmml.feature_extraction.text import Splitter
#before change
mapper = DataFrameMapper([
        ('pet', TfidfVectorizer()),
        ])

#under change
mapper = DataFrameMapper([
        ('pet', TfidfVectorizer(norm=None,analyzer = "word", tokenizer = Splitter())),
        ])

jpmml / sklearn2pmml

Support for `TfidfVectorizer.norm` attribute #98