jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
892 stars 255 forks source link

DuplicatedFieldValueException after loading PMML in Java generated by Nyoka #264

Closed mbicanic closed 1 year ago

mbicanic commented 1 year ago

Training and exporting in Python

I am training a LightGBM model via its scikit-learn interface lightgbm.LGBMClassifier and then trying to export the model into PMML using sklearn2pmml. This is the code used to train the PMMLPipeline:

from lightgbm import LGBMClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

def train_model(model: LGBMClassifier, params: dict, X_train: pd.DataFrame, Y_train: np.ndarray):
    pipe = PMMLPipeline([('classifier', LGBMClassifier(**params))])
    pipe.fit(X_train, Y_train)
    X_sample = X_train.sample(n=100, random_state=42)
    pipe.verify(X_sample)
    sklearn2pmml(pipe, "model.pmml", with_repr=True)

Versions:

The generated PMML file

Due to confidentiality, I cannot share the whole PMML file here, but I can describe its general structure:

<MiningModel functionName="classification" algorithmName="LightGBM">
  <MiningSchema>
    <MiningField name="y" usageType="target"/>
    <MiningField name="feature1" importance="517.0"/>
    ...
    <MiningField name="feature47" importance="258.0"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="modelChain" missingPredictionTreatment="returnMissing">
    <Segment id="1">
      <True/>
      <MiningModel functionName="regression">
      <MiningSchema> SAME AS ABOVE, BUT WITHOUT FEATURE IMPORTANCES </MiningSchema>
      <Output>
        <OutputField name="lgbmValue" optype="continuous" dataType="double" isFinalResult="false"/>
      </Output>
      <Segmentation multipleModelMethod="sum" missingPredictionTreatment="returnLastPrediction">
        <Segment id="1">
          <True/>
          <TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction">
            DEFINITION OF TreeModel WITH A BUNCH OF <Node> TAGS
          </TreeModel>
        </Segment>
        ...
        <Segment id="364">
          <True/>
          <TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction">
            DEFINITION OF TreeModel WITH A BUNCH OF <Node> TAGS
          </TreeModel>
        </Segment>
      </Segmentation>
    </Segment>
    <Segment id="2">
      <True/>
      <RegressionModel functionName="classification" normalizationMethod="logit">
        <MiningSchema>
          <MiningField name="y" usageType="target"/>
          <MiningField name="lgbmValue"/>
        </MiningSchema>
        <Output>
          <OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
          <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
        </Output>
        <RegressionTable intercept="0.0" targetCategory="1">
          <NumericPredictor name="lgbmValue" coefficient="1.0"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="0"/>
      </RegressionModel>
    </Segment>
  </Segmentation>
  <ModelVerification recordCount="100">...</ModelVerification>
</MiningModel>

As you can see, there are three <OutputField> tags in total in the whole file:

  1. The internal LGBM output: <OutputField name="lgbmValue" optype="continuous" dataType="double" isFinalResult="false"/>
  2. The external classifier output regarding the probability the target is 0: <OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
  3. The external classifier output regarding the probability the target is 1: <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>

Loading into Java

I use the org.jpmml:jpmml-evaluator-metro:1.6.4 library to load and use PMML models in Java. This is the code I'm using:

Evaluator evaluator = new LoadingModelEvaluatorBuilder().load(new File('/path/to/model.pmml')).build();
evaluator.verify();
System.out.println("Output fields: " + evaluator.getOutputFields());
System.out.println("Target field(s): " + evaluator.getTargetFields());

With this snippet, I get the following output:

Output fields: [
  OutputField{name=probability_0, fieldName=probability_0, displayName=null, opType=continuous, dataType=double, finalResult=true, depth=1}, 
  OutputField{name=probability_1, fieldName=probability_1, displayName=null, opType=continuous, dataType=double, finalResult=true, depth=1}, 
  OutputField{name=predicted_y, fieldName=predicted_y, displayName=null, opType=categorical, dataType=integer, finalResult=true, depth=1}, 
  OutputField{name=probability_0, fieldName=probability_0, displayName=null, opType=continuous, dataType=double, finalResult=true, depth=0}, 
  OutputField{name=probability_1, fieldName=probability_1, displayName=null, opType=continuous, dataType=double, finalResult=true, depth=0}, 
  OutputField{name=predicted_y, fieldName=predicted_y, displayName=null, opType=categorical, dataType=integer, finalResult=true, depth=0}
]
Target field(s): [TargetField{name=y, fieldName=y, displayName=null, opType=categorical, dataType=integer}]

As you can see, all the output fields are duplicated - once for depth=1 and once for depth=0. These OutputField definitions are obviously not present in the PMML file itself, so I am wondering where they are coming from and how to get rid of them?

Problem: Cannot evaluate due to DuplicateFieldValueException

The problem with this is that I cannot call evaluator.evaluate(features), because I get the following error:

org.jpmml.evaluator.DuplicateFieldValueException: The value for field "probability_0" has already been defined
    at org.jpmml.evaluator.EvaluationContext.declare(EvaluationContext.java:130)
    at org.jpmml.evaluator.OutputUtil.evaluate(OutputUtil.java:438)
    at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:467)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateInternal(MiningModelEvaluator.java:224)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:300)

I tried everything I could find in previous related issues (https://github.com/jpmml/jpmml-sparkml/issues/92, https://github.com/jpmml/jpmml-sparkml-xgboost/issues/13, https://github.com/jpmml/jpmml-sparkml-xgboost/issues/15) and the documentation, but it didn't help, so I am beginning to think this is an issue with the library, since phantom OutputFields are being created.

I apologize in advance if this is due to me using the library in a wrong way, but I would appreciate any and all help you could provide.

vruusmann commented 1 year ago

Just look at your own data!

SkLearn2PMML/JPMML-SkLearn produces a LightGBM model that has the following schema:

  1. Sole target field y
  2. Two probability-type output fields probability(0) and probability(1)

Does your misbehaving model.pmml look like the above?

vruusmann commented 1 year ago

Closing as invalid - the user is attempting to evaluate invalid PMML documents (generated by N****, not SkLearn2PMML).

mbicanic commented 1 year ago

@vruusmann I used to use Nyoka, but I had other issues with it. I guarantee that this particular model.pmml was generated with sklearn2pmml. I really don't understand why the hostility and the certainty I used nyoka? Here is the first few lines of the generated PMML, directly copy-pasted:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="SkLearn2PMML package" version="0.91.0"/>
        <Timestamp>2023-03-31T09:57:19Z</Timestamp>
    </Header>
    <MiningBuildTask>
        <Extension name="repr">PMMLPipeline(steps=[('classifier', LGBMClassifier(class_weight={0: 0.05, 1: 0.95},
               learning_rate=0.07168998753077896, max_depth=18,
               min_data_in_leaf=418, n_estimators=364, num_leaves=58,
               objective='binary', reg_alpha=0.07558439164814572,
               reg_lambda=0.05483594313753313))])</Extension>
    </MiningBuildTask>

It explicitly says SkLearn2PMML package, so I am very confused how you got to the conclusion I used nyoka? Please, undo the change of the issue title, because it is dishonest. I wouldn't come here with this question if I generated the PMML with nyoka, as I am well aware there could be incompatibilities between them.

And once we agree that the PMML file has indeed been generated with sklearn2pmml, I would greatly appreciate an explanation or at least a helping hint regarding the duplication of OutputFields in the loaded model.

vruusmann commented 1 year ago

I guarantee that this particular model.pmml was generated with sklearn2pmml.

Your DuplicatedFieldValueException was raised when scoring a Nyoka-produced PMML document. The JPMML-Evaluator library is not renaming existing OutputField elements, and is not inventing new ones.

That's a hard fact. No point in arguing - open your model.pmml in text editor, and take a look into it.

I really don't understand why the hostility and the certainty I used nyoka?

Because Nyoka is generating invalid/irreproducible PMML documents, and then it is me who has to prove over and over again that JPMML software is correct.

I wouldn't come here with this question if I generated the PMML with nyoka

Please attach your model.pmml here (or send it to my e-mail), so that we can resolve this issue based on factual matters.

vruusmann commented 1 year ago

org.jpmml.evaluator.DuplicateFieldValueException: The value for field "probability_0" has already been defined

All JPMML conversion libraries name probability fields using a probability(<category>) pattern. Nyoka (and related stuff) uses a probability_<category> pattern.

Now, seeing that the duplicate output fields is called probability_0, which statement is likely the correct one?

  1. The PMML document was generated by SkLearn2PMML (based on JPMML-SkLearn)
  2. The PMML document was generated by Nyoka.
mbicanic commented 1 year ago

@vruusmann I apologize, it was indeed my mistake. As I said, I used nyoka before, and had to migrate to sklearn2pmml and pmml-evaluator due to other issues.

The problem was that I am using MLflow to register models. I have a script that trains a model, saves it to PMML, and then registers the model together with the model.pmml artifact. Ever since I modified the script to use sklearn2pmml instead of nyoka, the connection to MLflow wasn't working properly, so even though the local PMML file created by the training script was indeed generated by sklearn2pmml, the "latest" MLflow model I was fetching in Java was still the one relying on a nyoka PMML.

Once again, I apologize for wasting your time and insisting I was correct, I was completely unaware of this problem. Nevertheless, I appreciate that you in the end explained why and how you know the file was Nyoka-generated - it was very helpful. Thank you for your time and effort!

vruusmann commented 1 year ago

Apology accepted!

The problem was that I am using MLflow to register models.

Do you have this MLflow integration project available somewhere? Pure Java, or Java-wrapped-into-Python?

I've meant to provide such integration myself, but haven't started yet.

mbicanic commented 1 year ago

Unfortunately, the project is not available publicly as it's a company project. However, it's not really an integration in the strict sense of the word, it's more of a bypass. I am normally registering the model as a Python sklearn model, and then additionally logging the PMML file as an artifact:

from lightgbm import LGBMClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

def train_model(model: LGBMClassifier, params: dict, X_train: pd.DataFrame, Y_train: np.ndarray):
    pipe = PMMLPipeline([('classifier', LGBMClassifier(**params))])
    pipe.fit(X_train, Y_train)

    X_sample = X_train.sample(n=100, random_state=42)
    pipe.verify(X_sample)
    sklearn2pmml(pipe, "model.pmml", with_repr=True)

    return pipe['classifier']

def log_model(model: LGBMClassifier, X_data: pd.DataFrame):
    mlflow.sklearn.log_model(
        sk_model: model,
        artifact_path: "",
        registered_model_name: MODEL_NAME,
        signature: mlflow.models.signature.infer_signature(X_data)
    )
    mlflow.log_artifact("model.pmml")  # referring to the local file generated in train_model

X_train, Y_train = load_dataset(...)
model = train_model(LGBMClassifier(), X_train, Y_train)
log_model(model, X_train)

And then in Java, instead of loading the model, I just load the PMML artifact as a file and initialize the Evaluator class with it:

    private Evaluator loadModel(String modelName) throws Exception {
        try (MlflowClient client = new MlflowClient(MLFLOW_URI)) {
            ModelRegistry.ModelVersion version = client.getRegisteredModel(modelName).getLatestVersions(0);
            File artifactDir = client.downloadArtifacts(version.getRunId());
            File[] files = artifactDir.listFiles(f -> f.getName().equals("model.pmml"));
            Evaluator evaluator = new LoadingModelEvaluatorBuilder().load(files[0]).build();
            evaluator.verify();
            return evaluator;
        }
    }

It's a pretty simple process, all things considered, and surprisingly easy to use Python models in Java this way, while also leveraging MLflow.