jpmml / jpmml-lightgbm

Java library and command-line application for converting LightGBM models to PMML
GNU Affero General Public License v3.0
174 stars 58 forks source link

Inconsistent predicted results between Python and Java #28

Closed ZhejunWu closed 4 years ago

ZhejunWu commented 4 years ago

We build a binary classification model in Python using lightgbm.train() (https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/engine.py#L18) and saved the .pmml file for Java side to use. Our model includes both continuous and categorical features. It seems the predicted score is different for the same feature input in Python and Java.

The Python side prediction is using model.predict() (https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L473). And the Java side predicted score is from getProbability() (https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/ProbabilityDistribution.java#L35).

If we get the predicted probability for the same target label, they are very different. For example, python score is 0.003362 and java score is 0.09655096497772561.

BUT, If we only include numerical features, the predicted scores are exactly the same.

How we cast the categorical features in python is like: df['col'] = df['col'].astype('category') And how the categorical features look like in pmml is like: <DataField name="col" optype="categorical" dataType="string">

Please advise if there's anything we can do to make the predicted scores consistent when including the categorical features. Thanks!

vruusmann commented 4 years ago

Does your dataset include missing values? If it does, then you should double-check that they are correctly/consistently encoded.

You appear to be using low-level LightGBM API (lightgbm.train(..)). Have you tested the same dataset with high-level API (LGBMClassifier.fit(..) and LGBMRegressor.fit(..)). High-level API captures feature definitions automatically.

There's a longer write-up available at: https://openscoring.io/blog/2019/04/07/converting_sklearn_lightgbm_pipeline_pmml/

vruusmann commented 4 years ago

Probably related to https://github.com/jpmml/jpmml-lightgbm/issues/27

I really can't assist/advise in either case, because there's not enough information - I need to see the data and the model object to pinpoint the exact issue/misunderstanding.

Both JPMML-LightGBM and JPMML-SkLearn projects include extensive integration tests in this area, so I'm very-very sure that my software is doing the correct job. You're just giving it wrong instructions.

ZhejunWu commented 4 years ago

Our dataset include missing values in both continuous and categorical features.

We've tried to use LGBMClassifier.fit(..) with the same model parameter settings, but the predicted probability seems different from lightgbm.train(..). Is that expected?

Our dataset and the model pmml were sent via email.

Any suggestions would be greatly appreciated!

vruusmann commented 4 years ago

Here's our dataset and the model pmml:

Doesn't seem to be accessible for me.

You may send it to my e-mail (a 100 row dataset would be totally sufficient for testing out the mechanics of feature specification).

vruusmann commented 4 years ago

@ZhejunWu Got your e-mail with sample data. Here's what I found/did.

Your dataset contains three types of columns - int/categorical, bool/categorical and float/continuous. You should encode the former two using [CategoricalDomain(), PMMLLabelEncoder()] and the latter one using [ContinuousDomain()].

While composing the DataFrameMapper instance, move those two categorical columns groups "to the front" so that their indices run from [0, len(int_columns) + len(bool_columns)]. This way it's easy to pass their indices to the Pipeline.fit(X, y, **fit_params) method as the categorical_feature fit parameter.

My full Python 3.7 code:

from lightgbm import LGBMClassifier
from pandas import DataFrame
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.preprocessing import PMMLLabelEncoder

import pandas

df = pandas.read_csv("training_data_subset.csv")
print(df.dtypes)

int_columns = list()
bool_columns = list()
float_columns = list()

columns = list(df.columns)
columns.remove("label")

for column in columns:
    dtype = df[column].dtype
    if dtype == int:
        int_columns.append(column)
    elif dtype == bool:
        bool_columns.append(column)
    elif dtype == float:
        float_columns.append(column)
    else:
        raise ValueError(dtype)

print(int_columns)
print(bool_columns)
print(float_columns)

mapper = DataFrameMapper(
    [([int_column], [CategoricalDomain(with_data = False, missing_values = None), PMMLLabelEncoder()]) for int_column in int_columns] +
    [([bool_column], [CategoricalDomain(with_data = False, missing_values = None), PMMLLabelEncoder()]) for bool_column in bool_columns] +
    [([float_column], [ContinuousDomain(with_data = False, missing_values = float("NaN"))]) for float_column in float_columns]
)
classifier = LGBMClassifier(objective = "binary", n_estimators = 71)

pipeline = PMMLPipeline([
    ("mapper", mapper),
    ("classifier", classifier)
])
pipeline.fit(df, df["label"], classifier__categorical_feature = range(0, len(int_columns) + len(bool_columns)))

label = DataFrame(pipeline.predict(df), columns = ["label"])
label_proba = DataFrame(pipeline.predict_proba(df), columns = ["probability(0)", "probability(1)"])

out = pandas.concat((label, label_proba), axis = 1)
out.to_csv("training_data_subset_predictions.csv", index = False)

sklearn2pmml(pipeline, "pipeline.pmml")

Later on I make predictions using the JPMML-Evaluator command-line application:

$ java -jar ~/Workspace/jpmml-evaluator/pmml-evaluator-example/target/pmml-evaluator-example-executable-1.4-SNAPSHOT.jar --model pipeline.pmml --input training_data_subset.csv --output output.csv --copy-columns false --missing-values ""

The predictions between Python (file "training_data_subset_predictions.csv") and Java (file "output.csv") are 100% in agreement (to the full precision of the 64-bit floating point number data type) for the entire training set.