jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
685 stars 113 forks source link

Failed to parse PKL via sklearn2pmml for KNeighborsClassifier #146

Closed ghost closed 5 years ago

ghost commented 5 years ago

Hi,

I got an exception while trying to export pmml from KNeighborsClassifier.

versions: Java 1.8.0_191-b12 Python 3.7.2 PIP packages lxml 4.3.3 numpy 1.16.2 pandas 0.24.2 patsy 0.5.1 pip 19.0.3 python-dateutil 2.8.0 pytz 2018.9 scikit-learn 0.20.3 scipy 1.2.1 setuptools 40.9.0 six 1.12.0 sklearn 0.0 sklearn-pandas 1.8.0 sklearn2pmml 0.44.0 statsmodels 0.9.0

scripts & error:


from sklearn.neighbors import KNeighborsClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
import numpy
import os
iris = datasets.load_iris()
X = iris.data
Y = iris.target
model = KNeighborsClassifier(n_neighbors=10)
pipeline = PMMLPipeline([
    ('KNeighborsClassifier', model)
])
pipeline.active_fields = numpy.array(iris.feature_names)
pipeline.target_fields = numpy.array('Species')
pipeline.fit(X, Y)
sklearn2pmml(pipeline, 'KNeighborsClassifier.pmml')

SEVERE: Failed to parse PKL
net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 219
        at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:355)
        at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:77)
        at net.razorvine.pickle.Unpickler.load(Unpickler.java:122)
        at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
        at org.jpmml.sklearn.Main.run(Main.java:104)
        at org.jpmml.sklearn.Main.main(Main.java:94)
Exception in thread "main" net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 219
        at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:355)
        at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:77)
        at net.razorvine.pickle.Unpickler.load(Unpickler.java:122)
        at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
        at org.jpmml.sklearn.Main.run(Main.java:104)
        at org.jpmml.sklearn.Main.main(Main.java:94)
vruusmann commented 5 years ago

FYI - you can "quote" blocks of code by using triple backticks.

Regarding this issue, then it's impossible to analyze or fix this issue without having access to the problematic pickle file. It's probably some OS/pickle library specific problem.

The JPMML-SkLearn library is depending on the latest Pyrolite library version, and if Pyrolite is unable to parse a Pickle file, then there's nothing that I can do about it.

ghost commented 5 years ago

Is this the same situation?


from sklearn import datasets
from sklearn import tree
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
import numpy
iris = datasets.load_iris()
X = iris.data
Y = iris.target
model = tree.DecisionTreeClassifier()
pipeline = PMMLPipeline([
    ('DecisionTreeClassifier', model)
])
pipeline.active_fields = numpy.array(iris.feature_names)
pipeline.target_fields = numpy.array('Species')
pipeline.fit(X, Y)
sklearn2pmml(pipeline, 'DecisionTreeClassifier.pmml')

SEVERE: Failed to parse PKL

net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 254
    at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:355)
    at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:77)
    at net.razorvine.pickle.Unpickler.load(Unpickler.java:122)
    at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
    at org.jpmml.sklearn.Main.run(Main.java:104)
    at org.jpmml.sklearn.Main.main(Main.java:94)
vruusmann commented 5 years ago

The first exception complains about pickle opcode 219, whereas the second one complains about 254. Even though the opcode is different, I suspect they both refer to the same problem - your Pickle and/or Python setup is broken in some way, and the sklearn.externals.joblib.dump() function is generating broken Pickle files.

Can you unpickle this same file in Python?

ghost commented 5 years ago

Hi @vruusmann ,

I can use joblib dump/load in Python.


from sklearn import datasets
from sklearn import tree
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn.externals import joblib
import numpy

iris = datasets.load_iris()
X = iris.data
Y = iris.target

pipeline = PMMLPipeline([
    ("DecisionTreeClassifier",  tree.DecisionTreeClassifier())
])
pipeline.active_fields = numpy.array(iris.feature_names)
pipeline.target_fields = numpy.array('Species')
pipeline.fit(X, Y)

dumpFile = "DecisionTreeClassifier-estimator.joblib"
joblib.dump(pipeline, dumpFile)

model2 = joblib.load(dumpFile)
model2.predict(X)
ghost commented 5 years ago

Just FYI. Same error happened in VotingClassifier like KNeighborsClassifier. The process complained at invalid pickle opcode: 219


from sklearn import datasets
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
import numpy

iris = datasets.load_iris()
X = iris.data
Y = iris.target

clf1 = LogisticRegression(solver='lbfgs',multi_class='ovr',random_state=0)
clf2 = GaussianNB()
clf3 = KNeighborsClassifier(n_neighbors=7)
model = VotingClassifier(estimators=[('lr', clf1), ('gnb', clf2), ('knn', clf3)], voting='hard')
target = 'Species'

pipeline = PMMLPipeline([
    ("VotingClassifier", model)
])
pipeline.active_fields = numpy.array(iris.feature_names)
pipeline.target_fields = numpy.array(target)

clf1.fit(X, Y)
clf2.fit(X, Y)
clf3.fit(X, Y)
pipeline.fit(X, Y)
sklearn2pmml(pipeline, 'VotingClassifier.pmml')

Standard output is empty
Traceback (most recent call last):
  File "VotingClassifier.py", line 43, in <module>
Standard error:
Apr 08, 2019 10:24:15 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Apr 08, 2019 10:24:15 AM org.jpmml.sklearn.Main run
SEVERE: Failed to parse PKL
net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 219
    at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:355)
    at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:77)
    at net.razorvine.pickle.Unpickler.load(Unpickler.java:122)
    at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
    at org.jpmml.sklearn.Main.run(Main.java:104)
    at org.jpmml.sklearn.Main.main(Main.java:94)
Exception in thread "main" net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 219
    at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:355)
    at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:77)
    at net.razorvine.pickle.Unpickler.load(Unpickler.java:122)
    at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
    at org.jpmml.sklearn.Main.run(Main.java:104)
    at org.jpmml.sklearn.Main.main(Main.java:94)
    sklearn2pmml(pipeline, relativeOutputPath + model_type + '.pmml')
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn2pmml\__init__.py", line 252, in sklearn2pmml
    raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams
vruusmann commented 5 years ago

If you can't successfully complete the simplest exercise - training a decision tree classifier for the iris dataset - then there's no point in trying anything more complicated.

Anyway, I maintain my original position that there's something wrong with the way how your Pickle/Scikit-Learn/Python/OS/Architecture is saving pickle files (they are corrupt, as indicated the net.razorvine.pickle.InvalidOpcodeException type).

If it was a global problem, then there would be one hundred high priority issues raised in this issue tracker right now. But there's only this one.

ghost commented 5 years ago

Hi @vruusmann ,

I just would like to clarify that the problem is not happening in pure python3 joblib dump and load. Btw, thanks for your help.

Below precedures are working fine.


joblib.dump(pipeline, dumpFile)
model2 = joblib.load(dumpFile)
ghost commented 5 years ago

Hi @vruusmann ,

After uninstalling python 3.7.2 (was downloaded from https://www.python.org/downloads/) and installing Anaconda 4.6.11 (using 3.7.3). The pmml could be generated correctly.

Sorry for my previous comment. #146 is not an issue on my end. I will also verify #148 and close if it is caused by the same situation.

Thank you!