Closed lanze0000 closed 1 year ago
This is a conversion-side issue, therefore moving it to the JPMML-SkLearn project (from the JPMML-Evaluator-Python project).
My requirement is to load a multi-output classification model. I tried to understand the issue. It
In brief, you're trying to train a multi-decision tree classifier. The default "schema" for decision tree classifiers has two components - the target field (eg. Species
) plus a collection of output fields (eg. probability(setosa)
, probability(versicolor)
, probability(virginica)
).
The problem with multi-decision tree classifiers is that the same field name gets declared many times. The correct would be to generate a new and unique field name for each classifier.
For example, we could add some "model index" component to field names. For example, if your multi-classifier contains two elementary classifiers, then we could use identifiers such as "first" and "second" to make a distinction between them:
Species(first)
and Species(second)
probability(first, setosa)
, probability(first, versicolor)
, probability(first, virginica)
and probability(second, setosa)
, probability(second, versicolor)
, probability(second, virginica)
How can this be handled?
As a quick workaround, open the PMML file in text editor, and add the missing "identifiers" manually.
How can this be handled?
Alternatively, if you're not interested in the predicted probability distribution, then you many manually delete Output
elements (the parent element of OutputField
elements).
Your target field names are already unique (ie. y1
, y2
, y3
).
It would therefore make sense to disambiguate output field names using the following pattern probability(<target name>, <target category name>)
: probability(y1, 0)
, probability(y2, 0)
and probability(y3, 0)
.
How can this be handled? 如何处理?
Alternatively, if you're not interested in the predicted probability distribution, then you many manually delete
Output
elements (the parent element ofOutputField
elements).或者,如果您对预测的概率分布不感兴趣,那么您可以手动删除Output
元素(OutputField
元素的父元素)。Your target field names are already unique (ie.
y1
,y2
,y3
).您的目标字段名称已经是唯一的(即y1
、y2
、y3
)。It would therefore make sense to disambiguate output field names using the following pattern
probability(<target name>, <target category name>)
:probability(y1, 0)
,probability(y2, 0)
andprobability(y3, 0)
.因此,使用以下模式probability(<target name>, <target category name>)
来消除输出字段名称的歧义是有意义的:probability(y1, 0)
、probability(y2, 0)
和probability(y3, 0)
。
Thank you very much. Your solution was effective and it solved my problem. May I ask if there are any other methods to achieve this goal by modifying parameters? I got the following result:
Input fields: ['g', 'd', 'e', 'a', 'j', 'c', 'h', 'b', 'f', 'i']
Target field(s): ['y1', 'y2', 'y3']
Output fields: ['probability(y1,0)', 'probability(y1,1)', 'probability(y2,0)', 'probability(y2,1)', 'probability(y3,0)', 'probability(y3,1)']
May I ask if there are any other methods to achieve this goal by modifying parameters?
This is the relevant portion of JPMML-SkLearn source code: https://github.com/jpmml/jpmml-sklearn/blob/1.7.27/pmml-sklearn/src/main/java/sklearn/multioutput/MultiOutputUtil.java#L45-L61
The problem is that the Estimator#encodeModel(Schema)
Java method does not know if it is being invoked in "single output mode" or "multi-output mode". For historical reasons, it is always assuming "single output mode", and is therefore generating duplicate field names.
The solution would be to send some kind of "context hint", which would then activate a different output field naming pattern.
Additionally, there could be another hint for disabling the generation of output fields in "multi output mode".
So many potential solutions, will need to think a little which one is the easiest & most long-lasting one (before the actual implementation happens).
Anyway, looks like an interesting/high priority issue, which should be addressed already in the next version.
The solution would be to send some kind of "context hint", which would then activate a different output field naming pattern.
See also https://github.com/jpmml/sklearn2pmml/issues/361
Additionally, there could be another hint for disabling the generation of output fields in "multi output mode".
See also https://github.com/jpmml/jpmml-sklearn/issues/180
Anyway, looks like an interesting/high priority issue
Yep, we definitely have a cluster of related issues in this area.
Ah, it's really an interesting question. I'm looking forward to your updates. Although I'm not very proficient yet, I will try your suggestions. Also, thank you very much for your patient explanations. You are really a kind person.
My requirement is to load a multi-output classification model. I tried to understand the issue. It occurs because there are three categories of output, and each category has a probability value of 0. How can this be handled? I attempted to assign names to "y" (y1, y2, y3), but this seems to have no effect. My code is as follows:
And I get the error message is as follows, :
How can I solve this issue? Do you have any good suggestions?Thank you very much!