jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
894 stars 255 forks source link

TreeModel prediction mismatch between KNIME and JPMML #36

Closed LuanGarrido closed 8 years ago

LuanGarrido commented 8 years ago

Hello,

I trained two models in Knime: a Neural Network and a Decision Tree.

Im comparing the results in Knime and in Java.

When taking look at the Neural Network, Im getting the same results.

When Decision Tree Model, Im getting all observation going to false.

I tried to read de PMML Model inside Knime and the results are not getting it.

Can you help me?

image

vruusmann commented 8 years ago

You need to share with me:

  1. A PMML file
  2. A CSV file with input data records (this is what you're getting from "CSV Reader", and is connected to "Decision Tree Learner" input port).
  3. A CSV file with expected output data records (this is what you're getting from "Decision Tree Predictor").

These two CSV files could only be five to ten data records in size - something that qualifies as a reproducible test case.

At the moment, without having seen any data, I'm pretty confident that the problem lies with KNIME - most probably it's simply producing incorrect PMML markup.

LuanGarrido commented 8 years ago

Yeah, offcourse =)

The first one contains train and test data.

The second one contains the PMML model created by the Knime workflow showerd before.

Thx for helping me =)

input.tar.gz decisionTree3.model.tar.gz

vruusmann commented 8 years ago

First, I split your tokensTagsTest3.csv file into input.csv (columns Col0, Col1 and Col2) and expected-output.csv files (column Col3). Then, I tested them with each other using the org.jpmml.evaluator.TestingExample example application:

$ java -cp ~/Workspace/jpmml-evaluator/pmml-evaluator-example/target/example-1.3-SNAPSHOT.jar org.jpmml.evaluator.TestingExample --model decisionTree3.model --input input.csv --expected-output expected-output.csv --separator ";" > diff.txt 2>&1

This testing reveals 60 conflicts.

The first conflict is on third input line:

Conflict{id=2, arguments={Col0=conj-s, Col1=v-fin, Col2=art}, difference=not equal: value differences={Col3=(false, NodeScoreDistribution{result=true, probability_entries=[false=0.25, true=0.75], entityId=350, confidence_entries=[]})}}

Now, if you open your decision tree PMML file in text editor, and execute its algorithm manually, then the winning Node elements are selected in this order: 261 -> 332 -> 350. And the Node@id=350 element predicts "true", with the associated probability distribution {"true" = 0.75, "false" = 0.25}.

The conclusion is that JPMML-Evaluator is carrying out the evaluation exactly as specified in the PMML file. If you're not happy with these predictions, then you need to look into KNIME. Most likely, KNIME is making some sort of error during PMML generation.

vruusmann commented 8 years ago

Out of curiosity, I took your training datset tokensTags3.csv and built a decision tree using R's "rpart" function:

library("rpart")
library("pmml")

tt = read.csv("tokensTags3.csv", sep = ";")

tt.rpart = rpart(Col3 ~ ., data = tt, method = "class", control = rpart.control(maxcompete = 0, maxsurrogate = 0))
saveXML(pmml(tt.rpart, dataset = tt.rpart), "tokensTags3.pmml")

classes = predict(tt.rpart, type = "class")
probabilities = predict(tt.rpart, type = "prob")

result = data.frame("Col3" = classes, "Predicted_Col3" = classes, "Probability_false" = probabilities[, 1], "Probability_true" = probabilities[, 2])
write.csv(result, "expected-output.csv", quote = FALSE, row.names = FALSE)

The testing now passes cleanly:

$ java -cp ~/Workspace/jpmml-evaluator/pmml-evaluator-example/target/example-1.3-SNAPSHOT.jar org.jpmml.evaluator.TestingExample --model tokensTags3.pmml --input tokensTags3.csv --expected-output expected-output.csv
LuanGarrido commented 8 years ago

I really appreciate for your help =)

Im going to try another framework in here.

Very thank you my friend