jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Failed after alias renaming DeriveField Name #80

Closed Quantuary closed 6 years ago

Quantuary commented 6 years ago

Hi,

1st of all, thank you for your great work!!

I came to some problem when trying to rename the 'DeriveField Name'. Example, I would like to remove "float()" instead of just using the name 'sepal-length'.

<TransformationDictionary>
        <DerivedField name="float(sepal-length)" optype="continuous" dataType="float">
            <FieldRef field="sepal-length"/>
        </DerivedField>
        <DerivedField name="float(sepal-width)" optype="continuous" dataType="float">
            <FieldRef field="sepal-width"/>

My code is as below:

import pandas as pd
from sklearn import ensemble

from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import ContinuousDomain
from sklearn.preprocessing import StandardScaler

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
iris = pd.read_csv(url, names=names)

x = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width']
y = ['class']

model = ensemble.RandomForestClassifier()
mapper = DataFrameMapper([
                 (['sepal-length'], StandardScaler(), {'alias': 'sepal-length'}),
                 (['sepal-width'], None, {'alias': 'sepal-width'}),
                 (['petal-length'], None, {'alias': 'petal-length'}),
                 (['petal-width'], None, {'alias': 'weird'})
                      ])
pipeline = PMMLPipeline([
        ("columns", mapper),
        ("classifier", model)
                        ])
pipeline.fit(iris[x], iris[y])
sklearn2pmml(pipeline, "py_rf.pmml", with_repr = True)

My error is:

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

How do i debug this, i had little knowledge about java.

Thank you for your help in advance.

vruusmann commented 6 years ago

The field sepal-length is already in use by a DataField element. It's not permitted to have another field element with the same name.

In the context of SkLearn decision tree models, the float(<name>) field is significant (performs conversion from 64-bit value space to 32-bit value space), and cannot be eliminated.

Quantuary commented 6 years ago

Thank you very much. The scoring engine my company using can not take the transformation dictionary very well. I guess i have to either stick with R or manually edit the XML file. Thanks again!

vruusmann commented 6 years ago

The scoring engine my company using can not take the transformation dictionary very well.

It's not permitted to "simplify" PMML documents for arbitrary reasons. In case of Scikit-Learn decision tree models, if you leave out this double-float conversion, then the predictions can/will be incorrect.

Better upgrade your scoring engine.