jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Customizing the name and data type of `DataField` elements #185

Closed WXinzhe closed 1 year ago

WXinzhe commented 1 year ago

Hi,

I want to use JPMML-SkLearn command-line to turn one pickle file to a PMML file. The command is as following:

java -Xmx4096m -jar pmml-sklearn-example-executable-1.7.24.jar --pkl-input scikit_model.pkl --pmml-output pmml_model.xml

There is no error in log. But in the output pmml xml file, the name of target DataField and the datatype of feature DataFields are not shown as expected. The name of target DataField should be "label_bool" not y. The datatype of feature DataFields should not be double for all features.

To reproduce this problem, i attached the test case, pickle file and PMML file. jpmmlscikitlearntest.zip

The data and model element are in test.py. For some reason, i can't use sklearn2pmml in my env. so i only can use the JPMML-SkLearn command-line to generate PMML file. so, Is there any way to specify the target DataField name and feature DataField type when using JPMML-SkLearn command-line? Please have a look. Thanks.

vruusmann commented 1 year ago

i can't use sklearn2pmml in my env

Why not?

When I try to run your test.py script on my computer, then it won't let me use the sklearn2pmml.sklearn2pmml(...) utility function because the type of the pipeline object is sklearn.pipeline.Pipeline.

However, when I replace it with sklearn2pmml.pipeline.PMMLPIpeline, then the test.py script completes successfully:

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

# THIS!
pipeline = PMMLPipeline([
    ("mapper", mapper),
    ("classifier", model)
])
pipeline.fit(x_train, y_train)

sklearn2pmml(pipeline, "scikit_model.pmml")
vruusmann commented 1 year ago

The name of target DataField should be "label_bool" not y.

My newly generated scikit_model.pmml PMML document contains correct target field name:

<DataDictionary>
    <DataField name="label_bool" optype="categorical" dataType="boolean">
        <Value value="false"/>
        <Value value="true"/>
    </DataField>
</DataDictionary>

The name of the target field should be stored in PMMLPipeline.target_fields attribute. This attribute is fully set in my above example code.

Please note that there are no feature or target name attributes in the "plain" Scikit-Learn pipeline class. You need to replace sklearn.pipeline.Pipeline class with sklearn2pmml.pipeline.PMMLPipeline class for them to become available.

vruusmann commented 1 year ago

The datatype of feature DataFields should not be double for all features.

The SkLearn pipeline object currently does not store any data type information. Therefore, the PMML converter cannot be more specific here.

Please note that the data type information is stored as x_train.dtypes. It stays there, it does not get "internalized" into the pipeline object as a result of the Pipeline.fit(X, y) method call.

You can customize column types in the leading "mapper" step, by using sklearn2pmml.decoration.CategoricalDomain(dtype = ...) and sklearn2pmml.decoration.ContinuousDomain(dtype = ...) decorators.

For example, persisting the information that the "int32" column is a (continuous-) integer, and the "bool1" column is a (categorical-) boolean:

mapper = DataFrameMapper([
  (["int32"], [ContinuousDomain(dtype = numpy.int32)]),
  (["bool1"], [CategoricalDomain(dtype = bool)])
])

This kind of "type-casting mapper" could be generated programmatically, using the x_train.types as a template.

Also, please note that your choice of numpy.inf as a placeholder for missing values is a really bad choice. It leads to over-complicated PMML markup, which then leads to inferior PMML evaluation speeds. Consider switching to numpy.nan instead.

vruusmann commented 1 year ago

Closing this issue as "works as intended".

The sample scipt runs fine on my computer (after replacing SkLearn pipeline class with SkLearn2PMML pipeline class). The SkLearn2PMML pipeline class will then be able to pick up column metadata (eg. target field name), and with some assisitance from the human operator, it will be possible to incorporate all the necessary data type information as well.

WXinzhe commented 1 year ago

Oh, thank you!