jpmml / jpmml-evaluator-spark

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
GNU Affero General Public License v3.0
94 stars 43 forks source link

Problem with parsing gbdt models #25

Closed HongHuangNeu closed 5 years ago

HongHuangNeu commented 5 years ago

I tried to construct gbdt model in spark with a pmml file which describes a gbdt model, and I got the following error:

"caused by:org.shaded.jpmml.evaluator.MissingFieldException: Field 'decisionFunction(1.0)' is not defined". What are the possible things I need to check?

Thanks!

vruusmann commented 5 years ago

"Field 'decisionFunction(1.0)' is not defined"

You should cast the data type of the label column from double (0.0/1.0) to either integer (0/1) or string ("0"/"1"), and re-train the GBDT model.

It looks like a target category formatting problem in some JPMML conversion library. Which ML framework/library were you using - JPMML-SparkML, JPMML-SkLearn, or something else?

TLDR: The correct name of the field should be decisionFunction(1) here.

HongHuangNeu commented 5 years ago

I used sklearn2pmml to generate gbdt models

HongHuangNeu commented 5 years ago

The label definition in the data dictionary part of my pmml is the following: `

`

so is the "typ2" column in double format, as is specified in the snippet? ( I asked because "1.0" and "0.0" seem to be represented in string format.)

vruusmann commented 5 years ago

I used sklearn2pmml to generate gbdt models

In that case cast the data type of the y variable from double to integer:

pipeline = PMMLPipeline(...)
pipeline.fit(X, y.astype(int))

Does it fix your problem? If the re-training is not an option, then you may try replacing all the occurrences of decisionFunction(1.0) with decisionFunction(1).

However, I'm quite surprised that the SkLearn2PMML/JPMML-SkLearn stack has produced such an invalid PMML document. It should be performing a full field name/scope resolution during conversion. Or perhaps have you changed anything about this particular PMML document manually?

HongHuangNeu commented 5 years ago

No manual change before that.

Just now I tried changing the decision function input. After I change 1.0 to 1,I also make 2 more changes to fix bump up errors: (1)change the "targetCategory" of the RegressionTable element from '1.0' to '1', and change '0.0' to '0' (2)change the dataType of label column in the DataDictionary from double to integer Now I got this error: "Field 'decisionFunction(1)' is not defined"

Are there any other related parts I need to fix?

vruusmann commented 5 years ago

(1)change the "targetCategory" of the RegressionTable element from '1.0' to '1', and change '0.0' to '0'

The values of the RegressionTable@targetCategory attribute must match exactly the values of the DataField/Value@value attribute.

Now I got this error: "Field 'decisionFunction(1)' is not defined"

This field is originally declared as some OutputField element. Assuming a binary classification GBDT model there should be exactly one such field in that PMML document.

vruusmann commented 5 years ago

As a general comment, I can recall a similar "field not found" exception reported against one of the SkLearn2PMML or JPMML-SkLearn projects in the past. Moreover, I can even remember fixing it.

What's your SkLearn2PMML package version? Maybe you're running some outdated version?

Any chance you can provide a reproducible example (Python script plus a CSV input file) about generating such broken PMML documents.

HongHuangNeu commented 5 years ago

My sklearn2pmml version is 0.39.0. Python version is 3.5.2 and sklearn version is 0.20.1

HongHuangNeu commented 5 years ago

I am afraid I cannot offer full example because of regulations of my client. I will try to work on a toy example to reproduce

strange thing: When I switch to python 2.7, I can generate a pmml with gbdt which contains double type label, and this file can be correctly parsed to generate gbdt models with spark jpmml-evaluator. Attached is my pmml file.

result.txt

So what is the problem here?

vruusmann commented 5 years ago

When I switch to python 2.7, I can generate a pmml with gbdt which contains double type label.

Must be that Python 2.X uses different Pandas/Numpy/Pickle package versions than Python 3.X. And these different package versions take care of double to integer conversion automatically.

In your public result.txt file the decisionFunction(1.0) field is defined on line 84, and then it is referenced exactly once on line 94. In your failing file, does the field resolution exception ("decisionFunction(1.0) is not defined") also happen in the same place (ie. inside the transformedDecisionFunction(1.0) output field declaration)?

HongHuangNeu commented 5 years ago

The error message did not point out which line of my file is wrong. The actual exception is "Failed to execute user defined function(blablabla...)" In the stack trace of the exception I can found that it was caused by "Field 'decisionFunction(1)' is not defined" exception.

HongHuangNeu commented 5 years ago

By the way: my spark jpmml-evaluator version is 1.2.0

HongHuangNeu commented 5 years ago

@vruusmann Sent you an email, describing the error in detail

HongHuangNeu commented 5 years ago

I will summarize the stack trace as follows:

org.apache.spark.SparkException. Failed to execute user defined function ......(with mismatch data types)

caused by:org.shaded.jpmml.evaluator.MissingFieldException: Field 'decisionFunction(1.0)' is not defined
                at org.shaded.jpmml.evaluator.EvaluationContext.lookup(EvaluationContext.java:64)

which is from https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/EvaluationContext.java#L64

Do you have any idea what could trigger this kind of exception?

vruusmann commented 5 years ago

Got your e-mail with the screenshot of the stack trace.

On that image, the field was called decisionFunction(1) (not the missing .0 suffix), which suggests that the exception happens with integer labels too? Or had you already modified this PMML document in some way?

In any case, would really need to have access to a reproducible test case. There is full integration test coverage for sklearn.ensemble.GradientBoostingClassifier available here: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L188 https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L394

Both integration test cases convert and evaluate correctly. What are you doing differently?

vruusmann commented 5 years ago

Attached is a demo archive, which trains a GradientBoostingClassifier for a binary classification problem where the label is encoded as double (0.0/1.0).

Training:

$ python main.py

Scoring:

$ java -jar ~/Workspace/jpmml-evaluator/pmml-evaluator-example/target/pmml-evaluator-example-executable-1.4-SNAPSHOT.jar --model Audit.pmml --input Audit.csv --output Audit-results.csv --copy-columns false

Everything works as advertised. Can you "break" this demo archive (changing something about GradientBoostingClassifier parameterization) so that it would start throwing this "field not found" exception?

Audit.zip

HongHuangNeu commented 5 years ago

I train the model with integer label and the problem is still there. The definition of "decisionFunction(1)" is just there in the output field and the evaluator cannot parse it, complaining about missing field. Again I cannot reproduce it in my own environment. Is it possible that it's related to environment issue? like outdated pmml-model dependency in the classpath?

vruusmann commented 5 years ago

In my demo archive, I can change the data type of the label column from double to integer, and everything still works correctly:

df["Adjusted"] = df["Adjusted"].astype(int)

I have demonstrated you two times that everything is OK. If you claim otherwise, then you need to back up your claims with hard evidence.

HongHuangNeu commented 5 years ago

@vruusmann Now I am able to reproduce the issue with a tiny program and a small gbdt pmml file. The following are the program and the pmml file. Remove the txt suffix and you can run them.

PMMLUnitTest.scala.txt

pipe.pmml.txt

In the program file PMMLUnitTest, I constructed two small data frame inputDF1 and inputDF2. The error happens when the evaluator is evaluating inputDF2. The error message is as follows:

Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$evaluationFunction$1$1: (struct<alcohol:double,typea:double,tobacco:double,age:double>) => struct<chd:int,probability(0):double,probability(1):double>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.shaded.jpmml.evaluator.MissingFieldException: Field "decisionFunction(1)" is not defined
    at org.shaded.jpmml.evaluator.EvaluationContext.lookup(EvaluationContext.java:64)
    at org.shaded.jpmml.evaluator.mining.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:589)
    at org.shaded.jpmml.evaluator.mining.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:315)
    at org.shaded.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:240)
    at org.shaded.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:209)
    at org.shaded.jpmml.evaluator.spark.PMMLTransformer$$anonfun$evaluationFunction$1$1.apply(PMMLTransformer.scala:78)
    at org.shaded.jpmml.evaluator.spark.PMMLTransformer$$anonfun$evaluationFunction$1$1.apply(PMMLTransformer.scala:66)
    ... 16 more
vruusmann commented 5 years ago

Thanks for the update - it's an important piece of information, that the field lookup exception happens selectively (does not happen with inputDF1, but happens with inputDF2).

The most likely explanation is that the evaluation of inputDF2 produces a so-called "missing value" in the first stage. The second stage expects to find a non-missing value; from its perspective, a "missing value" is the same as "undefined value".

Will take some time to think about appropriate solution. One thing is that the exception message should be more explicit about this distinction between "missing value" and "undefined value" - at the moment it seemed to suggest that perhaps some JPMML converter library is producing incorrect PMML documents (whereas in reality all JPMML converters and evaluators are correct, and the problem is related to the input data record).

Another thing is that it's possible to customize the "missing prediction handling" at the PMML language level: http://mantis.dmg.org/view.php?id=178

In the current case, the model evaluation process should probably throw org.jpmml.evaluator.InvalidResultException instead - the input data record is incomplete, and it's impossible to perform the requested computation on it.

vruusmann commented 5 years ago

The most likely explanation is that the evaluation of inputDF2 produces a so-called "missing value" in the first stage.

To elaborate - the first data record has Some(1.0), but the second data record has None.

Another solution is that the MiningSchema element of this model should simply state that all input fields must have non-missing values defined (ie MiningField@missingValueTreatment="x-returnInvalid").

HongHuangNeu commented 5 years ago

Interestingly: If I change the content of inputDF1, as follows:

`val inputRDD1 = spark.sparkContext.parallelize(Seq( TestEntry( a_date="2018-11-01",adiposity = 38.03,alcohol= Some(24.26),b_date="2018-10-02",chd=1.0, dst_sp = 114.0,famhist="Present",from_key="node/bb",ldl=6.41,new_diff = 30.0,obesity = 31.99,sbp=170.0,src_sp = 170.0,to_key="node/cc",tobacco = None,typea = 51.0,vfeature = 170.0,age=Some(58.0) ) ))

val inputDF1=spark.sqlContext.createDataFrame(inputRDD1) `

The evaluation of inputDF1 will NOT crash, even though the "tobacco" feature has null value.

My assumption is that this particular combination of feature values happen to bypass the branch in the model which triggers the evaluation of "tobacco" features, which should have triggered the "MissingFieldException".

Am I correct?

vruusmann commented 5 years ago

My assumption is that this particular combination of feature values happen to bypass the branch which triggers the evaluation of "tobacco" features.

Exactly. If you want to trigger this exception on purpose with inputDF1, then you need to set the value of some top-level input field to a missing value.

For example, the "age" input field appears to be a popular first splitting criterion. If you set the value of the "age" input field to a missing value, then the prediction should always fail.

vruusmann commented 5 years ago

This exception was changed from MissingFieldException to MissingValueException in JPMML-Evaluator version 1.4.5: https://github.com/jpmml/jpmml-evaluator/commit/60a836e7a3be3fdca132ed4896c43e0af3bc1ee3

The base version of the JPMML-Evaluator-Spark project is currently 1.4.4: https://github.com/jpmml/jpmml-evaluator-spark/blob/master/pom.xml#L8

So, a simple base version update (scheduled to happen later this week) should solve most of the confusion.