jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Disambiguating the output fields of multi-output models (example: `The value for field "probability(0)" has already been defined`) #184

Closed lanze0000 closed 1 year ago

lanze0000 commented 1 year ago

My requirement is to load a multi-output classification model. I tried to understand the issue. It occurs because there are three categories of output, and each category has a probability value of 0. How can this be handled? I attempted to assign names to "y" (y1, y2, y3), but this seems to have no effect. My code is as follows:

# -*- coding: utf-8 -*-
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn2pmml import PMMLPipeline
from sklearn2pmml.decoration import ContinuousDomain
from sklearn.preprocessing import StandardScaler
from sklearn2pmml import sklearn2pmml
import pandas as pd

X, y = make_multilabel_classification(n_samples=100, n_features=10, n_classes=3, random_state=1)
X = pd.DataFrame(X,columns=['a','b','c','d','e','f','g','h','i','j'])
y = pd.DataFrame(y,columns=['y1','y2','y3'])

classifier = MultiOutputClassifier(DecisionTreeClassifier())

pipeline = PMMLPipeline([
    ("classifier", classifier)
])

pipeline.fit(X, y)

sklearn2pmml(pipeline, "new1.pmml", with_repr=True, debug=True)

inputFields = evaluator.getInputFields()
print("Input fields: " + str([inputField.getName() for inputField in inputFields]))

targetFields = evaluator.getTargetFields()
print("Target field(s): " + str([targetField.getName() for targetField in targetFields]))

outputFields = evaluator.getOutputFields()
print("Output fields: " + str([outputField.getName() for outputField in outputFields]))

evaluator.evaluateAll(X.head(10))

And I get the error message is as follows, :

Input fields: ['b', 'f', 'g', 'd', 'e', 'a', 'j', 'c', 'h', 'i']
Target field(s): ['y1', 'y2', 'y3']
Output fields: ['probability(0)', 'probability(1)', 'probability(0)', 'probability(1)', 'probability(0)', 'probability(1)']
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
File PythonUtil.java:48, in org.jpmml.evaluator.python.PythonUtil.evaluateAll()

Exception: Java Exception

The above exception was the direct cause of the following exception:

org.jpmml.evaluator.DuplicateFieldValueExceptionTraceback (most recent call last)
File ~/.conda/envs/wenbin/lib/python3.8/site-packages/jpmml_evaluator/__init__.py:168, in Evaluator.evaluateAll(self, arguments_df, nan_as_missing)
    167 try:
--> 168     result_records = self.backend.staticInvoke("org.jpmml.evaluator.python.PythonUtil", "evaluateAll", self.javaEvaluator, argument_records)
    169 except Exception as e:

File ~/.conda/envs/wenbin/lib/python3.8/site-packages/jpmml_evaluator/jpype.py:38, in JPypeBackend.staticInvoke(self, className, methodName, *args)
     37 javaMember = getattr(javaClass, methodName)
---> 38 return javaMember(*args)

org.jpmml.evaluator.DuplicateFieldValueException: org.jpmml.evaluator.DuplicateFieldValueException: The value for field "probability(0)" has already been defined

During handling of the above exception, another exception occurred:

JavaError                                 Traceback (most recent call last)
Input In [37], in <cell line: 10>()
      7 outputFields = evaluator.getOutputFields()
      8 print("Output fields: " + str([outputField.getName() for outputField in outputFields]))
---> 10 evaluator.evaluateAll(X.head(10))

File ~/.conda/envs/wenbin/lib/python3.8/site-packages/jpmml_evaluator/__init__.py:170, in Evaluator.evaluateAll(self, arguments_df, nan_as_missing)
    168     result_records = self.backend.staticInvoke("org.jpmml.evaluator.python.PythonUtil", "evaluateAll", self.javaEvaluator, argument_records)
    169 except Exception as e:
--> 170     raise self.backend.toJavaError(e)
    171 result_records = self.backend.loads(result_records)
    172 results_df = DataFrame.from_records(result_records)

JavaError: org.jpmml.evaluator.DuplicateFieldValueException: The value for field "probability(0)" has already been defined

How can I solve this issue? Do you have any good suggestions?Thank you very much!

vruusmann commented 1 year ago

This is a conversion-side issue, therefore moving it to the JPMML-SkLearn project (from the JPMML-Evaluator-Python project).

My requirement is to load a multi-output classification model. I tried to understand the issue. It

In brief, you're trying to train a multi-decision tree classifier. The default "schema" for decision tree classifiers has two components - the target field (eg. Species) plus a collection of output fields (eg. probability(setosa), probability(versicolor), probability(virginica)).

The problem with multi-decision tree classifiers is that the same field name gets declared many times. The correct would be to generate a new and unique field name for each classifier.

For example, we could add some "model index" component to field names. For example, if your multi-classifier contains two elementary classifiers, then we could use identifiers such as "first" and "second" to make a distinction between them:

vruusmann commented 1 year ago

How can this be handled?

As a quick workaround, open the PMML file in text editor, and add the missing "identifiers" manually.

vruusmann commented 1 year ago

How can this be handled?

Alternatively, if you're not interested in the predicted probability distribution, then you many manually delete Output elements (the parent element of OutputField elements).

Your target field names are already unique (ie. y1, y2, y3).

It would therefore make sense to disambiguate output field names using the following pattern probability(<target name>, <target category name>): probability(y1, 0), probability(y2, 0) and probability(y3, 0).

lanze0000 commented 1 year ago

How can this be handled? 如何处理?

Alternatively, if you're not interested in the predicted probability distribution, then you many manually delete Output elements (the parent element of OutputField elements).或者,如果您对预测的概率分布不感兴趣,那么您可以手动删除 Output 元素( OutputField 元素的父元素)。

Your target field names are already unique (ie. y1, y2, y3).您的目标字段名称已经是唯一的(即 y1y2y3 )。

It would therefore make sense to disambiguate output field names using the following pattern probability(<target name>, <target category name>): probability(y1, 0), probability(y2, 0) and probability(y3, 0).因此,使用以下模式 probability(<target name>, <target category name>) 来消除输出字段名称的歧义是有意义的: probability(y1, 0)probability(y2, 0)probability(y3, 0)

Thank you very much. Your solution was effective and it solved my problem. May I ask if there are any other methods to achieve this goal by modifying parameters? I got the following result:

Input fields: ['g', 'd', 'e', 'a', 'j', 'c', 'h', 'b', 'f', 'i']
Target field(s): ['y1', 'y2', 'y3']
Output fields: ['probability(y1,0)', 'probability(y1,1)', 'probability(y2,0)', 'probability(y2,1)', 'probability(y3,0)', 'probability(y3,1)']

image

vruusmann commented 1 year ago

May I ask if there are any other methods to achieve this goal by modifying parameters?

This is the relevant portion of JPMML-SkLearn source code: https://github.com/jpmml/jpmml-sklearn/blob/1.7.27/pmml-sklearn/src/main/java/sklearn/multioutput/MultiOutputUtil.java#L45-L61

The problem is that the Estimator#encodeModel(Schema) Java method does not know if it is being invoked in "single output mode" or "multi-output mode". For historical reasons, it is always assuming "single output mode", and is therefore generating duplicate field names.

The solution would be to send some kind of "context hint", which would then activate a different output field naming pattern.

Additionally, there could be another hint for disabling the generation of output fields in "multi output mode".

So many potential solutions, will need to think a little which one is the easiest & most long-lasting one (before the actual implementation happens).

Anyway, looks like an interesting/high priority issue, which should be addressed already in the next version.

vruusmann commented 1 year ago

The solution would be to send some kind of "context hint", which would then activate a different output field naming pattern.

See also https://github.com/jpmml/sklearn2pmml/issues/361

Additionally, there could be another hint for disabling the generation of output fields in "multi output mode".

See also https://github.com/jpmml/jpmml-sklearn/issues/180

Anyway, looks like an interesting/high priority issue

Yep, we definitely have a cluster of related issues in this area.

lanze0000 commented 1 year ago

Ah, it's really an interesting question. I'm looking forward to your updates. Although I'm not very proficient yet, I will try your suggestions. Also, thank you very much for your patient explanations. You are really a kind person.