Closed mbicanic closed 1 year ago
Just look at your own data!
SkLearn2PMML/JPMML-SkLearn produces a LightGBM model that has the following schema:
y
probability(0)
and probability(1)
Does your misbehaving model.pmml
look like the above?
Closing as invalid - the user is attempting to evaluate invalid PMML documents (generated by N****, not SkLearn2PMML).
@vruusmann I used to use Nyoka, but I had other issues with it. I guarantee that this particular model.pmml was generated with sklearn2pmml. I really don't understand why the hostility and the certainty I used nyoka
? Here is the first few lines of the generated PMML, directly copy-pasted:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
<Header>
<Application name="SkLearn2PMML package" version="0.91.0"/>
<Timestamp>2023-03-31T09:57:19Z</Timestamp>
</Header>
<MiningBuildTask>
<Extension name="repr">PMMLPipeline(steps=[('classifier', LGBMClassifier(class_weight={0: 0.05, 1: 0.95},
learning_rate=0.07168998753077896, max_depth=18,
min_data_in_leaf=418, n_estimators=364, num_leaves=58,
objective='binary', reg_alpha=0.07558439164814572,
reg_lambda=0.05483594313753313))])</Extension>
</MiningBuildTask>
It explicitly says SkLearn2PMML package
, so I am very confused how you got to the conclusion I used nyoka
? Please, undo the change of the issue title, because it is dishonest. I wouldn't come here with this question if I generated the PMML with nyoka, as I am well aware there could be incompatibilities between them.
And once we agree that the PMML file has indeed been generated with sklearn2pmml
, I would greatly appreciate an explanation or at least a helping hint regarding the duplication of OutputFields
in the loaded model.
I guarantee that this particular model.pmml was generated with sklearn2pmml.
Your DuplicatedFieldValueException
was raised when scoring a Nyoka-produced PMML document. The JPMML-Evaluator library is not renaming existing OutputField
elements, and is not inventing new ones.
That's a hard fact. No point in arguing - open your model.pmml
in text editor, and take a look into it.
I really don't understand why the hostility and the certainty I used nyoka?
Because Nyoka is generating invalid/irreproducible PMML documents, and then it is me who has to prove over and over again that JPMML software is correct.
I wouldn't come here with this question if I generated the PMML with nyoka
Please attach your model.pmml
here (or send it to my e-mail), so that we can resolve this issue based on factual matters.
org.jpmml.evaluator.DuplicateFieldValueException: The value for field "probability_0" has already been defined
All JPMML conversion libraries name probability fields using a probability(<category>)
pattern. Nyoka (and related stuff) uses a probability_<category>
pattern.
Now, seeing that the duplicate output fields is called probability_0
, which statement is likely the correct one?
@vruusmann I apologize, it was indeed my mistake. As I said, I used nyoka before, and had to migrate to sklearn2pmml and pmml-evaluator due to other issues.
The problem was that I am using MLflow to register models. I have a script that trains a model, saves it to PMML, and then registers the model together with the model.pmml
artifact. Ever since I modified the script to use sklearn2pmml
instead of nyoka
, the connection to MLflow wasn't working properly, so even though the local PMML file created by the training script was indeed generated by sklearn2pmml
, the "latest" MLflow model I was fetching in Java was still the one relying on a nyoka
PMML.
Once again, I apologize for wasting your time and insisting I was correct, I was completely unaware of this problem. Nevertheless, I appreciate that you in the end explained why and how you know the file was Nyoka-generated - it was very helpful. Thank you for your time and effort!
Apology accepted!
The problem was that I am using MLflow to register models.
Do you have this MLflow integration project available somewhere? Pure Java, or Java-wrapped-into-Python?
I've meant to provide such integration myself, but haven't started yet.
Unfortunately, the project is not available publicly as it's a company project. However, it's not really an integration in the strict sense of the word, it's more of a bypass. I am normally registering the model as a Python sklearn
model, and then additionally logging the PMML file as an artifact:
from lightgbm import LGBMClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
def train_model(model: LGBMClassifier, params: dict, X_train: pd.DataFrame, Y_train: np.ndarray):
pipe = PMMLPipeline([('classifier', LGBMClassifier(**params))])
pipe.fit(X_train, Y_train)
X_sample = X_train.sample(n=100, random_state=42)
pipe.verify(X_sample)
sklearn2pmml(pipe, "model.pmml", with_repr=True)
return pipe['classifier']
def log_model(model: LGBMClassifier, X_data: pd.DataFrame):
mlflow.sklearn.log_model(
sk_model: model,
artifact_path: "",
registered_model_name: MODEL_NAME,
signature: mlflow.models.signature.infer_signature(X_data)
)
mlflow.log_artifact("model.pmml") # referring to the local file generated in train_model
X_train, Y_train = load_dataset(...)
model = train_model(LGBMClassifier(), X_train, Y_train)
log_model(model, X_train)
And then in Java, instead of loading the model, I just load the PMML artifact as a file and initialize the Evaluator class with it:
private Evaluator loadModel(String modelName) throws Exception {
try (MlflowClient client = new MlflowClient(MLFLOW_URI)) {
ModelRegistry.ModelVersion version = client.getRegisteredModel(modelName).getLatestVersions(0);
File artifactDir = client.downloadArtifacts(version.getRunId());
File[] files = artifactDir.listFiles(f -> f.getName().equals("model.pmml"));
Evaluator evaluator = new LoadingModelEvaluatorBuilder().load(files[0]).build();
evaluator.verify();
return evaluator;
}
}
It's a pretty simple process, all things considered, and surprisingly easy to use Python models in Java this way, while also leveraging MLflow.
Training and exporting in Python
I am training a LightGBM model via its
scikit-learn
interfacelightgbm.LGBMClassifier
and then trying to export the model into PMML usingsklearn2pmml
. This is the code used to train thePMMLPipeline
:Versions:
scikit-learn
: 1.2.0sklearn2pmml
: 0.91.0The generated PMML file
Due to confidentiality, I cannot share the whole PMML file here, but I can describe its general structure:
As you can see, there are three
<OutputField>
tags in total in the whole file:<OutputField name="lgbmValue" optype="continuous" dataType="double" isFinalResult="false"/>
<OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
<OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
Loading into Java
I use the
org.jpmml:jpmml-evaluator-metro:1.6.4
library to load and use PMML models in Java. This is the code I'm using:With this snippet, I get the following output:
As you can see, all the output fields are duplicated - once for
depth=1
and once fordepth=0
. TheseOutputField
definitions are obviously not present in the PMML file itself, so I am wondering where they are coming from and how to get rid of them?Problem: Cannot evaluate due to DuplicateFieldValueException
The problem with this is that I cannot call
evaluator.evaluate(features)
, because I get the following error:I tried everything I could find in previous related issues (https://github.com/jpmml/jpmml-sparkml/issues/92, https://github.com/jpmml/jpmml-sparkml-xgboost/issues/13, https://github.com/jpmml/jpmml-sparkml-xgboost/issues/15) and the documentation, but it didn't help, so I am beginning to think this is an issue with the library, since phantom OutputFields are being created.
I apologize in advance if this is due to me using the library in a wrong way, but I would appreciate any and all help you could provide.