Closed ZhejunWu closed 4 years ago
Does your dataset include missing values? If it does, then you should double-check that they are correctly/consistently encoded.
You appear to be using low-level LightGBM API (lightgbm.train(..)
). Have you tested the same dataset with high-level API (LGBMClassifier.fit(..)
and LGBMRegressor.fit(..)
). High-level API captures feature definitions automatically.
There's a longer write-up available at: https://openscoring.io/blog/2019/04/07/converting_sklearn_lightgbm_pipeline_pmml/
Probably related to https://github.com/jpmml/jpmml-lightgbm/issues/27
I really can't assist/advise in either case, because there's not enough information - I need to see the data and the model object to pinpoint the exact issue/misunderstanding.
Both JPMML-LightGBM and JPMML-SkLearn projects include extensive integration tests in this area, so I'm very-very sure that my software is doing the correct job. You're just giving it wrong instructions.
Our dataset include missing values in both continuous and categorical features.
We've tried to use LGBMClassifier.fit(..)
with the same model parameter settings, but the predicted probability seems different from lightgbm.train(..)
. Is that expected?
Our dataset and the model pmml were sent via email.
Any suggestions would be greatly appreciated!
Here's our dataset and the model pmml:
Doesn't seem to be accessible for me.
You may send it to my e-mail (a 100 row dataset would be totally sufficient for testing out the mechanics of feature specification).
@ZhejunWu Got your e-mail with sample data. Here's what I found/did.
Your dataset contains three types of columns - int/categorical, bool/categorical and float/continuous. You should encode the former two using [CategoricalDomain(), PMMLLabelEncoder()]
and the latter one using [ContinuousDomain()]
.
While composing the DataFrameMapper
instance, move those two categorical columns groups "to the front" so that their indices run from [0, len(int_columns) + len(bool_columns)]
. This way it's easy to pass their indices to the Pipeline.fit(X, y, **fit_params)
method as the categorical_feature
fit parameter.
My full Python 3.7 code:
from lightgbm import LGBMClassifier
from pandas import DataFrame
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.preprocessing import PMMLLabelEncoder
import pandas
df = pandas.read_csv("training_data_subset.csv")
print(df.dtypes)
int_columns = list()
bool_columns = list()
float_columns = list()
columns = list(df.columns)
columns.remove("label")
for column in columns:
dtype = df[column].dtype
if dtype == int:
int_columns.append(column)
elif dtype == bool:
bool_columns.append(column)
elif dtype == float:
float_columns.append(column)
else:
raise ValueError(dtype)
print(int_columns)
print(bool_columns)
print(float_columns)
mapper = DataFrameMapper(
[([int_column], [CategoricalDomain(with_data = False, missing_values = None), PMMLLabelEncoder()]) for int_column in int_columns] +
[([bool_column], [CategoricalDomain(with_data = False, missing_values = None), PMMLLabelEncoder()]) for bool_column in bool_columns] +
[([float_column], [ContinuousDomain(with_data = False, missing_values = float("NaN"))]) for float_column in float_columns]
)
classifier = LGBMClassifier(objective = "binary", n_estimators = 71)
pipeline = PMMLPipeline([
("mapper", mapper),
("classifier", classifier)
])
pipeline.fit(df, df["label"], classifier__categorical_feature = range(0, len(int_columns) + len(bool_columns)))
label = DataFrame(pipeline.predict(df), columns = ["label"])
label_proba = DataFrame(pipeline.predict_proba(df), columns = ["probability(0)", "probability(1)"])
out = pandas.concat((label, label_proba), axis = 1)
out.to_csv("training_data_subset_predictions.csv", index = False)
sklearn2pmml(pipeline, "pipeline.pmml")
Later on I make predictions using the JPMML-Evaluator command-line application:
$ java -jar ~/Workspace/jpmml-evaluator/pmml-evaluator-example/target/pmml-evaluator-example-executable-1.4-SNAPSHOT.jar --model pipeline.pmml --input training_data_subset.csv --output output.csv --copy-columns false --missing-values ""
The predictions between Python (file "training_data_subset_predictions.csv") and Java (file "output.csv") are 100% in agreement (to the full precision of the 64-bit floating point number data type) for the entire training set.
We build a binary classification model in Python using
lightgbm.train()
(https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/engine.py#L18) and saved the.pmml
file for Java side to use. Our model includes both continuous and categorical features. It seems the predicted score is different for the same feature input in Python and Java.The Python side prediction is using
model.predict()
(https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L473). And the Java side predicted score is fromgetProbability()
(https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/ProbabilityDistribution.java#L35).If we get the predicted probability for the same target label, they are very different. For example, python score is 0.003362 and java score is 0.09655096497772561.
BUT, If we only include numerical features, the predicted scores are exactly the same.
How we cast the categorical features in python is like:
df['col'] = df['col'].astype('category')
And how the categorical features look like in pmml is like:<DataField name="col" optype="categorical" dataType="string">
Please advise if there's anything we can do to make the predicted scores consistent when including the categorical features. Thanks!