Java library and command-line application for converting Apache Spark ML pipelines to PMML
The value for field \"gbtValue\" is not defined error #59

yairdata commented 5 years ago

i am running a gbtClassifier pipeline in pyspark i was able to produce the pmml file , but when testing it - it failed with the above error. there is no input field in this name , and in the pmml the gbtValue is defined as:

i have supplied json input with valid inputs (took it from the pmml itself) what could be the reason for this error ?

vruusmann commented 5 years ago

Something like this has been reported and analyzed/explained before. You should search closed issues of this project, or the Pyspark2PMML project.

Off the top of my head - you have a model chain (one model executed after another), and the first model is returning a missing prediction, so the value of the xgbValue output field remains undefined, and the second model (that requires it as input) then complains/fails.

yairdata commented 5 years ago

sounds logical , but i don't have 2 models... i don't see any evidence to such error in pyspark2pmml my pipeline is:

for categoricalCol in categoricalCols: stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + "hashed",handleInvalid="keep") stages += [stringIndexer] encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"]) stages += [encoder]

 HashedInputs = [c + "classVec" for c in categoricalCols] +[d for d in continuousCols if d not in [f for f in date_columns]]
  assembler = VectorAssembler(inputCols=HashedInputs,outputCol="features") 

stages += [assembler ]

 fieldsSelector = SQLTransformer(
      statement="SELECT *, {} AS seclabel FROM __THIS__".format(sec_alert))

stages += [fieldsSelector]

gbt = GBTClassifier(featuresCol="features",maxBins=10,maxDepth=3,maxIter=1)

stages += [gbt]

pipeline = Pipeline(stages=stages)

the generated pmml is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
 <PMML xmlns="" xmlns:data=" 
              model/InlineTable" version="4.3">
    <Application name="JPMML-SparkML" version="1.4.7"/>
    <DataField name="XB16" optype="categorical" dataType="string">
        <Value value="EMPTY"/>
        <Value value="0"/>
        <Value value="1"/>
        <Value value="__unknown" property="invalid"/>
    <DataField name="T_TYPE" optype="categorical" dataType="string">
        <Value value="A"/>
        <Value value="B"/>
        <Value value="C"/>
        <Value value="A_A"/>
        <Value value="C_COVER"/>
        <Value value="D"/>
        <Value value="E"/>
        <Value value="F"/>
        <Value value="__unknown" property="invalid"/>
    <DataField name="FIELD_A" optype="categorical" dataType="string">
        <Value value="N"/>
        <Value value="Y"/>
        <Value value="EMPTY"/>
        <Value value="__unknown" property="invalid"/>
    <DataField name="FIELD_B" optype="continuous" dataType="double"/>
    <DataField name="label" optype="categorical" dataType="double">
        <Value value="0"/>
        <Value value="1"/>
<MiningModel functionName="classification">
        <MiningField name="label" usageType="target"/>
        <MiningField name="FIELD_A" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
        <MiningField name="FIELD_B"/>
        <MiningField name="T_TYPE" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
        <MiningField name="XB16" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
    <Segmentation multipleModelMethod="modelChain">
        <Segment id="1">
            <MiningModel functionName="regression">
                    <MiningField name="FIELD_A"/>
                    <MiningField name="FIELD_B"/>
                    <MiningField name="T_TYPE"/>
                    <MiningField name="XB16"/>
                    <OutputField name="gbtValue" optype="continuous" dataType="double" feature="predictedValue" isFinalResult="false"/>
                <Segmentation multipleModelMethod="x-weightedSum">
                    <Segment id="1">
                        <TreeModel functionName="regression" missingValueStrategy="nullPrediction" noTrueChildStrategy="returnLastPrediction" splitCharacteristic="multiSplit">
                                <MiningField name="XB16"/>
                                <MiningField name="T_TYPE"/>
                                <MiningField name="FIELD_A"/>
                                <MiningField name="FIELD_B"/>
                            <Node score="-0.9524188958451907">
                                <Node score="1.0">
                                    <SimplePredicate field="FIELD_A" operator="equal" value="Y"/>
                                <Node score="-1.0">
                                    <SimplePredicate field="FIELD_B" operator="lessOrEqual" value="5000.0"/>
                                    <Node score="0.933649289099526">
                                        <SimplePredicate field="XB16" operator="equal" value="EMPTY"/>
                                <Node score="0.34146341463414637">
                                    <SimplePredicate field="T_TYPE" operator="equal" value="D"/>
        <Segment id="2">
            <RegressionModel functionName="classification" normalizationMethod="logit">
                    <MiningField name="label" usageType="target"/>
                    <MiningField name="gbtValue"/>
                    <OutputField name="pmml(prediction)" optype="categorical" dataType="double" feature="predictedValue"/>
                    <OutputField name="prediction" optype="categorical" dataType="double" feature="transformedValue">
                        <MapValues outputColumn="data:output">
                            <FieldColumnPair field="pmml(prediction)" column="data:input"/>
                    <OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
                    <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
                <RegressionTable intercept="0.0" targetCategory="1">
                    <NumericPredictor name="gbtValue" coefficient="2.0"/>
                <RegressionTable intercept="0.0" targetCategory="0"/>

vruusmann commented 5 years ago

but i don't have 2 models...

But you do - the MiningModel element has two Segment child elements; the first segmentation model produces a missing prediction, and the second segmentation model complains about it.

TLDR: Find out why the first segmentation is making a missing prediction. 99% of time you're simply omitting some required input field value.