jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

The value for field \"gbtValue\" is not defined error #59

Closed yairdata closed 5 years ago

yairdata commented 5 years ago

i am running a gbtClassifier pipeline in pyspark i was able to produce the pmml file , but when testing it - it failed with the above error. there is no input field in this name , and in the pmml the gbtValue is defined as:

i have supplied json input with valid inputs (took it from the pmml itself) what could be the reason for this error ?

vruusmann commented 5 years ago

Something like this has been reported and analyzed/explained before. You should search closed issues of this project, or the Pyspark2PMML project.

Off the top of my head - you have a model chain (one model executed after another), and the first model is returning a missing prediction, so the value of the xgbValue output field remains undefined, and the second model (that requires it as input) then complains/fails.

yairdata commented 5 years ago

sounds logical , but i don't have 2 models... i don't see any evidence to such error in pyspark2pmml my pipeline is:

for categoricalCol in categoricalCols: stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + "hashed",handleInvalid="keep") stages += [stringIndexer] encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"]) stages += [encoder]

 HashedInputs = [c + "classVec" for c in categoricalCols] +[d for d in continuousCols if d not in [f for f in date_columns]]
  assembler = VectorAssembler(inputCols=HashedInputs,outputCol="features") 

stages += [assembler ]

 fieldsSelector = SQLTransformer(
      statement="SELECT *, {} AS seclabel FROM __THIS__".format(sec_alert))

stages += [fieldsSelector]

gbt = GBTClassifier(featuresCol="features",maxBins=10,maxDepth=3,maxIter=1)

stages += [gbt]

pipeline = Pipeline(stages=stages)

the generated pmml is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
 <PMML xmlns="http://www.dmg.org/PMML-4_3" xmlns:data="http://jpmml.org/jpmml- 
              model/InlineTable" version="4.3">
<Header>
    <Application name="JPMML-SparkML" version="1.4.7"/>
    <Timestamp>2019-03-26T09:43:41Z</Timestamp>
</Header>
<DataDictionary>
    <DataField name="XB16" optype="categorical" dataType="string">
        <Value value="EMPTY"/>
        <Value value="0"/>
        <Value value="1"/>
        <Value value="__unknown" property="invalid"/>
    </DataField>
    <DataField name="T_TYPE" optype="categorical" dataType="string">
        <Value value="A"/>
        <Value value="B"/>
        <Value value="C"/>
        <Value value="A_A"/>
        <Value value="C_COVER"/>
        <Value value="D"/>
        <Value value="E"/>
        <Value value="F"/>
        <Value value="__unknown" property="invalid"/>
    </DataField>
    <DataField name="FIELD_A" optype="categorical" dataType="string">
        <Value value="N"/>
        <Value value="Y"/>
        <Value value="EMPTY"/>
        <Value value="__unknown" property="invalid"/>
    </DataField>
    <DataField name="FIELD_B" optype="continuous" dataType="double"/>
    <DataField name="label" optype="categorical" dataType="double">
        <Value value="0"/>
        <Value value="1"/>
    </DataField>
</DataDictionary>
<MiningModel functionName="classification">
    <MiningSchema>
        <MiningField name="label" usageType="target"/>
        <MiningField name="FIELD_A" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
        <MiningField name="FIELD_B"/>
        <MiningField name="T_TYPE" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
        <MiningField name="XB16" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
    </MiningSchema>
    <Segmentation multipleModelMethod="modelChain">
        <Segment id="1">
            <True/>
            <MiningModel functionName="regression">
                <MiningSchema>
                    <MiningField name="FIELD_A"/>
                    <MiningField name="FIELD_B"/>
                    <MiningField name="T_TYPE"/>
                    <MiningField name="XB16"/>
                </MiningSchema>
                <Output>
                    <OutputField name="gbtValue" optype="continuous" dataType="double" feature="predictedValue" isFinalResult="false"/>
                </Output>
                <Segmentation multipleModelMethod="x-weightedSum">
                    <Segment id="1">
                        <True/>
                        <TreeModel functionName="regression" missingValueStrategy="nullPrediction" noTrueChildStrategy="returnLastPrediction" splitCharacteristic="multiSplit">
                            <MiningSchema>
                                <MiningField name="XB16"/>
                                <MiningField name="T_TYPE"/>
                                <MiningField name="FIELD_A"/>
                                <MiningField name="FIELD_B"/>
                            </MiningSchema>
                            <Node score="-0.9524188958451907">
                                <True/>
                                <Node score="1.0">
                                    <SimplePredicate field="FIELD_A" operator="equal" value="Y"/>
                                </Node>
                                <Node score="-1.0">
                                    <SimplePredicate field="FIELD_B" operator="lessOrEqual" value="5000.0"/>
                                    <Node score="0.933649289099526">
                                        <SimplePredicate field="XB16" operator="equal" value="EMPTY"/>
                                    </Node>
                                </Node>
                                <Node score="0.34146341463414637">
                                    <SimplePredicate field="T_TYPE" operator="equal" value="D"/>
                                </Node>
                            </Node>
                        </TreeModel>
                    </Segment>
                </Segmentation>
            </MiningModel>
        </Segment>
        <Segment id="2">
            <True/>
            <RegressionModel functionName="classification" normalizationMethod="logit">
                <MiningSchema>
                    <MiningField name="label" usageType="target"/>
                    <MiningField name="gbtValue"/>
                </MiningSchema>
                <Output>
                    <OutputField name="pmml(prediction)" optype="categorical" dataType="double" feature="predictedValue"/>
                    <OutputField name="prediction" optype="categorical" dataType="double" feature="transformedValue">
                        <MapValues outputColumn="data:output">
                            <FieldColumnPair field="pmml(prediction)" column="data:input"/>
                            <InlineTable>
                                <row>
                                    <data:input>0</data:input>
                                    <data:output>0</data:output>
                                </row>
                                <row>
                                    <data:input>1</data:input>
                                    <data:output>1</data:output>
                                </row>
                            </InlineTable>
                        </MapValues>
                    </OutputField>
                    <OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
                    <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
                </Output>
                <RegressionTable intercept="0.0" targetCategory="1">
                    <NumericPredictor name="gbtValue" coefficient="2.0"/>
                </RegressionTable>
                <RegressionTable intercept="0.0" targetCategory="0"/>
            </RegressionModel>
        </Segment>
    </Segmentation>
</MiningModel>

vruusmann commented 5 years ago

but i don't have 2 models...

But you do - the MiningModel element has two Segment child elements; the first segmentation model produces a missing prediction, and the second segmentation model complains about it.

TLDR: Find out why the first segmentation is making a missing prediction. 99% of time you're simply omitting some required input field value.