jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Add support for `distance` weight function in KNN models #42

Closed vivekk0903 closed 2 years ago

vivekk0903 commented 7 years ago

I am getting the the "returned non-zero exit status 1" error with the new version 0.17 sklearn2pmml, when using it with GridSearchCV.

Version info

('python: ', '2.7.6') ('sklearn: ', '0.18.1') ('sklearn.externals.joblib:', '0.10.3') ('pandas: ', u'0.19.2') ('sklearn_pandas: ', '1.3.0') ('sklearn2pmml: ', '0.17.0')

Code to reproduce

1) Working correctly:

from sklearn.datasets import load_boston
boston_data = load_boston()
X = boston_data.data
y = boston_data.target

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml

knn_pipe = PMMLPipeline([
("regressor", KNeighborsRegressor())
])

knn_pipe.fit(X,y)
sklearn2pmml(knn_pipe, ".../SimpleFit.pmml", with_repr = True, debug = True)

2) Throwing error:

from sklearn.datasets import load_boston
boston_data = load_boston()
X = boston_data.data
y = boston_data.target

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml

knn_pipe = PMMLPipeline([
("regressor", KNeighborsRegressor())
])

param_grid = {"regressor__n_neighbors": [3, 2,10],
          "regressor__weights": ["uniform","distance"],
          "regressor__algorithm": ["auto", "ball_tree", "kd_tree"]}
cv = GridSearchCV(knn_pipe, param_grid=param_grid)
cv.fit(X,y)

Using the following line gives "TypeError: The pipeline object is not an instance of PMMLPipeline" which is understandable.

sklearn2pmml(cv, ".../GridSearchFit.pmml", with_repr = True, debug = True)

So I tried using cv.bestestimator in it, but it throws the "returned non-zero exit status 1" error.

sklearn2pmml(cv.best_estimator_, ".../GridSearchFit.pmml", with_repr = True, debug = True)

Stack trace of error:

('python: ', '2.7.6')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('pandas: ', u'0.19.2')
('sklearn_pandas: ', '1.3.0')
('sklearn2pmml: ', '0.17.0')
java -cp /usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-api-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-schema-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-metro-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pyrolite-4.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-agent-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jcommander-1.48.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-sklearn-1.2.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/guava-19.0.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-converter-1.2.1.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/serpent-1.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.2.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.5.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-1.3.4.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-yd1bTD.pkl.z --repr-pipeline PMMLPipeline(steps=[('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=10, p=2,
          weights='distance'))]) --pmml-output /home/.../GridSearchFit.pmml
('Preserved joblib dump file(s): ', '/tmp/pipeline-yd1bTD.pkl.z')
Traceback (most recent call last):

  File "<ipython-input-12-b7a0923021e7>", line 1, in <module>
    sklearn2pmml(cv.best_estimator_, "/home/.../GridSearchFit.pmml", with_repr = True, debug = True)

  File "/usr/local/lib/python2.7/dist-packages/sklearn2pmml/__init__.py", line 132, in sklearn2pmml
    subprocess.check_call(cmd)

  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)

CalledProcessError: Command '['java', '-cp', '/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-api-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-schema-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-metro-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pyrolite-4.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-agent-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jcommander-1.48.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-sklearn-1.2.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/guava-19.0.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-converter-1.2.1.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/serpent-1.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.2.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.5.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-1.3.4.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', '/tmp/pipeline-yd1bTD.pkl.z', '--repr-pipeline', "PMMLPipeline(steps=[('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',\n          metric_params=None, n_jobs=1, n_neighbors=10, p=2,\n          weights='distance'))])", '--pmml-output', '/home/.../GridSearchFit.pmml']' returned non-zero exit status 1

Here is the pickle saved file for this error. I have renamed it from Grid_pipeline-yd1bTD.pkl.z to Grid_pipeline-yd1bTD.pkl.zip to be able to upload here. Grid_pipeline-yd1bTD.pkl.zip

vivekk0903 commented 7 years ago

Sorry, I meant to post it on sklearn2pmml page, but by mistake posted it here. Sorry again.

vruusmann commented 7 years ago

('sklearn2pmml: ', '0.17.0')

That's a fairly outdated version.

Please upgrade to latest version of SkLearn2PMML, which is 0.20.3 at the moment.

cv = GridSearchCV(knn_pipe, param_grid=param_grid) sklearn2pmml(cv.best_estimator_, "GridSearchFit.pmml", with_repr = True)

The sklearn2pmml() function requires the first argument to be an instance of sklearn2pmml.PMMLPipeline. After fitting a GridSearchCV meta-model, then you should construct a dummy PMMLPipeline simply like this:

cv = GridSearchCV(...)
cv.fit(X, y)

pipeline = PMMLPipeline([
  ("best_estimator", cv.best_estimator_)
])
# Additionally, set feature and label names
pipeline.active_fields = X.columns.values
pipeline.target_field = y.name

sklearn2pmml(pipeline, ...)

I tried converting the attached Pickle file with JPMML-SkLearn command-line application, and got the following result:

$ java -jar target/converter-executable-1.3-SNAPSHOT.jar --pkl-input Grid_pipeline-yd1bTD.pkl.z --pmml-output Grid_pipeline.pmml

juuni 20, 2017 10:18:39 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
juuni 20, 2017 10:18:40 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 77 ms.
juuni 20, 2017 10:18:40 AM org.jpmml.sklearn.Main run
INFO: Converting..
juuni 20, 2017 10:18:40 AM sklearn2pmml.PMMLPipeline encodePMML
WARNING: The 'target_field' attribute is not set. Assuming y as the name of the target field
juuni 20, 2017 10:18:40 AM sklearn2pmml.PMMLPipeline initFeatures
WARNING: The 'active_fields' attribute is not set. Assuming [x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13] as the names of active fields
juuni 20, 2017 10:18:40 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: distance
        at sklearn.neighbors.KNeighborsUtil.encodeNeighbors(KNeighborsUtil.java:127)
        at sklearn.neighbors.KNeighborsRegressor.encodeModel(KNeighborsRegressor.java:56)
        at sklearn.neighbors.KNeighborsRegressor.encodeModel(KNeighborsRegressor.java:31)
        at sklearn.Estimator.encodeModel(Estimator.java:46)
        at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:136)
        at org.jpmml.sklearn.Main.run(Main.java:144)
        at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.IllegalArgumentException: distance
        at sklearn.neighbors.KNeighborsUtil.encodeNeighbors(KNeighborsUtil.java:127)
        at sklearn.neighbors.KNeighborsRegressor.encodeModel(KNeighborsRegressor.java:56)
        at sklearn.neighbors.KNeighborsRegressor.encodeModel(KNeighborsRegressor.java:31)
        at sklearn.Estimator.encodeModel(Estimator.java:46)
        at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:136)
        at org.jpmml.sklearn.Main.run(Main.java:144)
        at org.jpmml.sklearn.Main.main(Main.java:93)

In brief, the problem is that your KNeighborsRegressor model object uses distance weight function, which is currently not supported. You should fall back to the uniform distance function (this is SkLearn's default):

regressor = KNeighborsRegressor(..., weights = "uniform")

The distance weight function can be represented in PMML for the most part. IIRC, there only needs to be a special handler for the zero distance.

vivekk0903 commented 7 years ago

Ok, I have followed as you said.

1) Updated to newest available version. Had actually upgraded the version before posting the issue, but was using the command pip install --user git+https://github.com/jpmml/sklearn2pmml.git. But I overlooked the '--user' option and was testing on another user.

Now the error message has become a bit more clear.

('python: ', '2.7.6')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('pandas: ', u'0.19.2')
('sklearn_pandas: ', '1.3.0')
('sklearn2pmml: ', '0.20.3')
java -cp /usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/guava-20.0.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-api-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-schema-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-metro-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pyrolite-4.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-agent-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-schema-1.3.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-converter-1.2.3.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-sklearn-1.3.3.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pyrolite-4.19.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jcommander-1.48.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-sklearn-1.2.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/guava-19.0.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-agent-1.3.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.7.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-converter-1.2.1.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/serpent-1.18.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/serpent-1.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-1.3.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.2.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.7.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.5.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-metro-1.3.6.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-QbjihM.pkl.z --pmml-output /home/local/EZDI/vivek.k/GridSearchFit.pmml
('Preserved joblib dump file(s): ', '/tmp/pipeline-QbjihM.pkl.z')
Traceback (most recent call last):

  File "<ipython-input-4-ab9a7c1ff136>", line 7, in <module>
    sklearn2pmml(cv.best_estimator_, "/home/local/EZDI/vivek.k/GridSearchFit.pmml", with_repr = True, debug = True)

  File "/usr/local/lib/python2.7/dist-packages/sklearn2pmml/__init__.py", line 142, in sklearn2pmml
    raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

2) Why is there a need to wrap the cv.best_estimator_ again inside a PMMLPipeline?

The cv.best_estimator_ is an instance of PMMLPipeline. When I check the type(cv.best_estimator_), its returning as sklearn2pmml.PMMLPipeline. So I dont think it should be necessary to wrap it again. because when I removed the "distance" from the parameters to grid-search, I am getting no errors in using cv.best_estimator_ inside the sklearn2pmml command.

vruusmann commented 7 years ago

Why is there a need to wrap the cv.bestestimator again inside a PMMLPipeline?

Very interesting - I didn't known that GridSearchCV can take a pipeline as the first argument. I was assuming that you were using a "raw" estimator as the first argument, and wanted to know how to wrap the cv.best_estimator_ to make it acceptable for the sklearn2pmml() function.

Probably got misled by this StackOverflow thread: https://stackoverflow.com/questions/44643123

Indeed, if the cv.best_estimator_ is already an sklearn2pmml.PMMLPipeline, then there's no need to re-wrap it.

vivekk0903 commented 7 years ago

That was me who gave the suggestion to directly use the cv.best_estimator_ in the sklearn2pmml command, before I tested it out myself and found this issue.

The single answer on that page is written after I put the issue here, and using your recommendation to wrap it again. So I dont know who is following whom here. :p

So the only issue that remains here is to add support of weight function "distance".