Results from the LightGBM predict method, are not the same as after evaluate method.

jpmml / jpmml-lightgbm

Java library and command-line application for converting LightGBM models to PMML

GNU Affero General Public License v3.0

174 stars 58 forks source link

Results from the LightGBM predict method, are not the same as after evaluate method. #33

Closed ogrygorian-clgx closed 4 years ago

ogrygorian-clgx commented 4 years ago

I'm creating the lightgbm.basic.Booster from the previously trained and saved to file LightGBM model. After that, I'm doing a prediction using "predict" method.
From the same file with saved LightGBM model I'm creating a pmml using "jpmml-lightgbm" library and doing prediction using "evaluate" method.
Compare results and they are not the same. What can be the reason of this behavior?

vruusmann commented 4 years ago

What can be the reason of this behavior?

Most likely, an incomplete and/or invalid feature specification.

I understand that the original LightGBM model was trained using the low-level "Training API", and it was later wrapped manually to be compatible with "Scikit-Learn API"? Did you pay attention to the representation of missing values (NaN, -999, -1, 0, something else), encoding of categorical features (as integers, as strings), etc.

For comparison, train a LightGBM model using "Scikit-Learn API" form scratch, and everything should work as advertised.

If you're able to provide a reproducible example where the JPMML stack is unable to reproduce LightGBM predictions for a dense dataset of continuous features only (ie. no missing values, no categorical features), then it would be something that I'd be interested in looking into. Right now it's an obscure human operator error.

ogrygorian-clgx commented 4 years ago

@vruusmann Thank you for the fast response. I'm sorry, I don't train the model and can't answer all your questions. My part was to create pmml from the model and use it in the Java app. But what I can see from the incoming data, that they have "NaN" instead of missing values. Do you have any suggestions about how the null values should be treated to work better with jpmml?

vruusmann commented 4 years ago

they have "NaN" instead of missing values.

According to (J)PMML conventions, NaN is an invalid value, not a missing value.

Do you have any suggestions about how the null values should be treated to work better with jpmml?

The representation of missing values is PMML engine specific. For example, the JPMML-Evaluator library treats a null reference as a missing value:

Map<FieldName, Object> arguments = new LinkedHashMap<>();
arguments.put(FieldName.create("MyField"), null);

As a first step, you should filter your input data and replace all Double.NaN and Float.NaN values with null references.

I assume that right now all predictions are wrong for data records that contain NaN values. How about the other data records? Are they correct or not?

ogrygorian-clgx commented 4 years ago

@vruusmann Thank you very much for the advice. I'll try that. I believe, that all data records contain NaN values. So hopefully it'll help.

vruusmann commented 4 years ago

If you were using Scikit-Learn based LightGBM-to-PMML conversion workflow, then you could declare that all invalid values (here NaN values) should be automatically converted to missing values using the sklearn2pmml.decoration.(Categorical|Continuous)Domain decorators:

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([
    ("MyContinuousField", ContinuousDomain(invalid_value_treatment = "as_missing"))
  ])),
  ("classifier", LGBMClassifier())
])
pipeline.fit(X, y, categorical_feature = ..)

This way, the PMML document would contain the following markup:

<MiningModel>
  <MiningSchema>
    <MiningField name="MyContinuousField" invalidValueTreatment="asMissing"/>
  </MiningSchema>
</MiningModel>

This MiningField@invalidValueTreatment attribute value would cause the model to perform this NaN -> null replacement automatically for you. In other words, you could send the original unfiltered dataset into the model, and there should be correct predictions coming out.

Reference: http://dmg.org/pmml/v4-3/MiningSchema.html

ogrygorian-clgx commented 4 years ago

@vruusmann I've changed all NaN values to null and results after evaluation are the same as after SparkMl prediction. Thank you very much for the suggestion. One more question not related to jpmml-lightgbm. I was trying to use jpmml-transpiler library, but it didn't work for me. I think it was because of the size of my models' file. It is 2-3 Mb. Is there any way to use jpmml-transpiler in this case? Or I should create an issue in the jpmml-transpiler repo?

vruusmann commented 4 years ago

I've changed all NaN values to null and results after evaluation are the same as after SparkMl prediction.

That's what I suspected. Anyway, this is a rather common problem (with no good default workaround mechanism on the (J)PMML side), so it's nice to have everything documented here - can point other people to this thread in the future.

I was trying to use jpmml-transpiler library, but it didn't work for me. I think it was because of the size of my models' file.

Yes, you should open a separate issue on the JPMML-Transpiler project about that. The command-line application doesn't have any restrictions (it was only at the REST web service level). Anyway, paste your exception stack trace there, and I'll explain what's going on.