Closed yairdata closed 5 years ago
Something like this has been reported and analyzed/explained before. You should search closed issues of this project, or the Pyspark2PMML project.
Off the top of my head - you have a model chain (one model executed after another), and the first model is returning a missing prediction, so the value of the xgbValue
output field remains undefined, and the second model (that requires it as input) then complains/fails.
sounds logical , but i don't have 2 models... i don't see any evidence to such error in pyspark2pmml my pipeline is:
for categoricalCol in categoricalCols: stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + "hashed",handleInvalid="keep") stages += [stringIndexer] encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"]) stages += [encoder]
HashedInputs = [c + "classVec" for c in categoricalCols] +[d for d in continuousCols if d not in [f for f in date_columns]]
assembler = VectorAssembler(inputCols=HashedInputs,outputCol="features")
stages += [assembler ]
fieldsSelector = SQLTransformer(
statement="SELECT *, {} AS seclabel FROM __THIS__".format(sec_alert))
stages += [fieldsSelector]
gbt = GBTClassifier(featuresCol="features",maxBins=10,maxDepth=3,maxIter=1)
stages += [gbt]
pipeline = Pipeline(stages=stages)
the generated pmml is:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_3" xmlns:data="http://jpmml.org/jpmml-
model/InlineTable" version="4.3">
<Header>
<Application name="JPMML-SparkML" version="1.4.7"/>
<Timestamp>2019-03-26T09:43:41Z</Timestamp>
</Header>
<DataDictionary>
<DataField name="XB16" optype="categorical" dataType="string">
<Value value="EMPTY"/>
<Value value="0"/>
<Value value="1"/>
<Value value="__unknown" property="invalid"/>
</DataField>
<DataField name="T_TYPE" optype="categorical" dataType="string">
<Value value="A"/>
<Value value="B"/>
<Value value="C"/>
<Value value="A_A"/>
<Value value="C_COVER"/>
<Value value="D"/>
<Value value="E"/>
<Value value="F"/>
<Value value="__unknown" property="invalid"/>
</DataField>
<DataField name="FIELD_A" optype="categorical" dataType="string">
<Value value="N"/>
<Value value="Y"/>
<Value value="EMPTY"/>
<Value value="__unknown" property="invalid"/>
</DataField>
<DataField name="FIELD_B" optype="continuous" dataType="double"/>
<DataField name="label" optype="categorical" dataType="double">
<Value value="0"/>
<Value value="1"/>
</DataField>
</DataDictionary>
<MiningModel functionName="classification">
<MiningSchema>
<MiningField name="label" usageType="target"/>
<MiningField name="FIELD_A" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
<MiningField name="FIELD_B"/>
<MiningField name="T_TYPE" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
<MiningField name="XB16" x-invalidValueReplacement="__unknown" invalidValueTreatment="asIs"/>
</MiningSchema>
<Segmentation multipleModelMethod="modelChain">
<Segment id="1">
<True/>
<MiningModel functionName="regression">
<MiningSchema>
<MiningField name="FIELD_A"/>
<MiningField name="FIELD_B"/>
<MiningField name="T_TYPE"/>
<MiningField name="XB16"/>
</MiningSchema>
<Output>
<OutputField name="gbtValue" optype="continuous" dataType="double" feature="predictedValue" isFinalResult="false"/>
</Output>
<Segmentation multipleModelMethod="x-weightedSum">
<Segment id="1">
<True/>
<TreeModel functionName="regression" missingValueStrategy="nullPrediction" noTrueChildStrategy="returnLastPrediction" splitCharacteristic="multiSplit">
<MiningSchema>
<MiningField name="XB16"/>
<MiningField name="T_TYPE"/>
<MiningField name="FIELD_A"/>
<MiningField name="FIELD_B"/>
</MiningSchema>
<Node score="-0.9524188958451907">
<True/>
<Node score="1.0">
<SimplePredicate field="FIELD_A" operator="equal" value="Y"/>
</Node>
<Node score="-1.0">
<SimplePredicate field="FIELD_B" operator="lessOrEqual" value="5000.0"/>
<Node score="0.933649289099526">
<SimplePredicate field="XB16" operator="equal" value="EMPTY"/>
</Node>
</Node>
<Node score="0.34146341463414637">
<SimplePredicate field="T_TYPE" operator="equal" value="D"/>
</Node>
</Node>
</TreeModel>
</Segment>
</Segmentation>
</MiningModel>
</Segment>
<Segment id="2">
<True/>
<RegressionModel functionName="classification" normalizationMethod="logit">
<MiningSchema>
<MiningField name="label" usageType="target"/>
<MiningField name="gbtValue"/>
</MiningSchema>
<Output>
<OutputField name="pmml(prediction)" optype="categorical" dataType="double" feature="predictedValue"/>
<OutputField name="prediction" optype="categorical" dataType="double" feature="transformedValue">
<MapValues outputColumn="data:output">
<FieldColumnPair field="pmml(prediction)" column="data:input"/>
<InlineTable>
<row>
<data:input>0</data:input>
<data:output>0</data:output>
</row>
<row>
<data:input>1</data:input>
<data:output>1</data:output>
</row>
</InlineTable>
</MapValues>
</OutputField>
<OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
<OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
</Output>
<RegressionTable intercept="0.0" targetCategory="1">
<NumericPredictor name="gbtValue" coefficient="2.0"/>
</RegressionTable>
<RegressionTable intercept="0.0" targetCategory="0"/>
</RegressionModel>
</Segment>
</Segmentation>
</MiningModel>
but i don't have 2 models...
But you do - the MiningModel
element has two Segment
child elements; the first segmentation model produces a missing prediction, and the second segmentation model complains about it.
TLDR: Find out why the first segmentation is making a missing prediction. 99% of time you're simply omitting some required input field value.
i am running a gbtClassifier pipeline in pyspark i was able to produce the pmml file , but when testing it - it failed with the above error. there is no input field in this name , and in the pmml the gbtValue is defined as:
i have supplied json input with valid inputs (took it from the pmml itself) what could be the reason for this error ?