jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
895 stars 255 forks source link

Field pmml(pred) is not defined. #263

Closed PowerToThePeople111 closed 1 year ago

PowerToThePeople111 commented 1 year ago

Hi Villu,

I got a spark pipeline doing several data preprocessing steps with a final logistic regression and exported that successfully to a pmml file. When loading it in the openscoring server and trying to produce a prediction, I get the following message:

"Field \"pmml(pred)\" is not defined"

I read through (I guess) all of the issues in your repositories and found that this kind of message is expected when the undefined column is used in a following segment of the pipeline. But actually pred is the output of the final logistic regression.

If you have no idea about why that could happen in mind, please ignore my request. I am kind of in a hurry work-wise which is why I have problems making an example for you to run. I am just hoping for a pointer into the right direction.

I have ...

This is the beginning of the pmml file:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="spark" version="0.1-SNAPSHOT"/>
        <Timestamp>2023-03-28T06:50:25Z</Timestamp>
    </Header>
    <DataDictionary>
        <DataField name="appids_string_fixed" optype="categorical" dataType="string"/>
        <DataField name="genres_string" optype="categorical" dataType="string"/>
        <DataField name="client_string_fixed" optype="categorical" dataType="string"/>
        <DataField name="gender_string" optype="categorical" dataType="string"/>
        <DataField name="age_double" optype="continuous" dataType="double"/>
        <DataField name="apilevelage" optype="continuous" dataType="double"/>
        <DataField name="label" optype="categorical" dataType="double">
            <Value value="0"/>
            <Value value="1"/>
        </DataField>
    </DataDictionary>
    <TransformationDictionary>
        <DefineFunction name="tf@1" optype="continuous" dataType="integer">
            <ParameterField name="document"/>
            <ParameterField name="term"/>
            <TextIndex textField="document">
                <FieldRef field="term"/>
            </TextIndex>
        </DefineFunction>
        <DefineFunction name="tf@2" optype="continuous" dataType="integer">
            <ParameterField name="document"/>
            <ParameterField name="term"/>
            <TextIndex textField="document">
                <FieldRef field="term"/>
            </TextIndex>
        </DefineFunction>
        <DefineFunction name="tf@3" optype="continuous" dataType="integer">
            <ParameterField name="document"/>
            <ParameterField name="term"/>
            <TextIndex textField="document">
                <FieldRef field="term"/>
            </TextIndex>
        </DefineFunction>
        <DefineFunction name="tf@5" optype="continuous" dataType="integer">
            <ParameterField name="document"/>
            <ParameterField name="term"/>
            <TextIndex textField="document">
                <FieldRef field="term"/>
            </TextIndex>
        </DefineFunction>
    </TransformationDictionary>
    <RegressionModel functionName="classification" normalizationMethod="logit">
        <MiningSchema>
            <MiningField name="label" usageType="target"/>
            <MiningField name="age_double"/>
            <MiningField name="genres_string"/>
            <MiningField name="appids_string_fixed"/>
            <MiningField name="client_string_fixed"/>
            <MiningField name="gender_string"/>
            <MiningField name="apilevelage"/>
        </MiningSchema>
        <Output>
            <OutputField name="pmml(pred)" optype="categorical" dataType="double" isFinalResult="false"/>
            <OutputField name="pred" optype="continuous" dataType="double" feature="transformedValue">
                <MapValues outputColumn="data:output" dataType="double">
                    <FieldColumnPair field="pmml(pred)" column="data:input"/>
                    <InlineTable>
                        <row>
                            <data:input>0</data:input>
                            <data:output>0</data:output>
                        </row>
                        <row>
                            <data:input>1</data:input>
                            <data:output>1</data:output>
                        </row>
                    </InlineTable>
                </MapValues>
            </OutputField>
            <OutputField name="prob(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
            <OutputField name="prob(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
        </Output>
        <LocalTransformations>
vruusmann commented 1 year ago

"Field \"pmml(pred)\" is not defined"

The PMML conversion process is OK, the PMML document is OK, and the PMML evaluation process is OK too.

The problem is that your input data record is NOT OK - it contains one or more missing values, and therefore the logistic regression (LR) model element evaluates to a missing value.

Your LR is trying to perform extra processing on this missing prediction, and therefore fails with error. Think of it as a Java's NullPointerException - can't perform an operation on a null object reference.

openscoring master (also tested on 2.0.4)

IIRC, the handling of missing predictions was improved in recent JPMML-Evaluator library versions. If the LR model evaluates to a missing value, then its Output element is not evaluated at all.

If you upgrade to the latest Openscoring 2.1(.1) version, do you get a more meaningful error?

vruusmann commented 1 year ago

How to debug:

  1. Extract a problematic input data record. If you compare it manually against your LR model schema, do you see any obvious problems with it? For example, are all input fields available, are their names correct, etc.
  2. Evaluate this data record in a controlled environment, with the latest JPMML-Evaluator library version. Do you get any extra information out of it then?
PowerToThePeople111 commented 1 year ago

Thank you so much!

The problem was not with the model file itself but with a feature that i put into it.

vruusmann commented 1 year ago

The problem was not with the model file itself but with a feature that i put into it.

Just remembered that the Openscoring service is supposed to issue a warning when it encounters a missing input value: https://github.com/openscoring/openscoring/blob/2.1.1/openscoring-service/src/main/java/org/openscoring/service/ModelResource.java#L467-L470

Check your log file!

PowerToThePeople111 commented 1 year ago

Hey,

i did not see that message but that is due to my own fault: I have rewritten and added different functions in the ModelResource class. The part that was doing the check was commented out - which was fine for the first iteration of my models where i tailored the modifications to the needs of my models. But now that they changed, I ran into this problem.

If I had kept that logic which you added there, I would not have stumbled into that issue.