Support lightgbm (boosting_type="rf") ?

OnlyFor commented 4 years ago

Here is my code,

import numpy as np
import pandas as pd
import lightgbm as lgb  # version 2.3.1
from sklearn2pmml import sklearn2pmml, make_pmml_pipeline # 0.52.0

df_ = pd.DataFrame({"aaaaaaaaaaaaaaaaaa": np.random.rand(10000)})
for i in range(20):
    df_["var_" + str(i)] = np.random.rand(10000)
for i in range(30, 100):
    df_["var_" + str(i)] = np.random.randint(0, 20, 10000)

df_.iloc[-2000:] = np.NaN
df_["target"] = np.random.randint(0, 2, 10000)

y = df_["target"]
X = df_.drop("target", axis=1)

model1 = lgb.sklearn.LGBMClassifier(
    **{
        "boosting_type": "gbdt",
        "max_depth": 3,
        "learning_rate": 0.05,
        "n_estimators": 10,
        # "bagging_fraction": 0.8,
        # "bagging_freq": 1,
        # "subsample": 0.8,
        # "subsample_freq": 1,
    }
)
model2 = lgb.sklearn.LGBMClassifier(
    **{
        "boosting_type": "rf",
        "max_depth": 3,
        "learning_rate": 0.05,
        "n_estimators": 10,
        "bagging_fraction": 0.8,
        "bagging_freq": 1,
        "subsample": 0.8,
        "subsample_freq": 1,
    }
)

model1.fit(X, y)
model2.fit(X, y)

df_["model1_p1"] = model1.predict_proba(X)[:, 1]
df_["model2_p1"] = model2.predict_proba(X)[:, 1]

df_.to_csv("input.csv", index=False, encoding="utf-8")

sklearn2pmml(make_pmml_pipeline(
    model1, active_fields=X.columns.tolist(), target_fields="target"), "model1.pmml")
sklearn2pmml(make_pmml_pipeline(
    model2, active_fields=X.columns.tolist(), target_fields="target"), "model2.pmml")

java -cp pmml-evaluator-example-executable-1.4.12.jar org.jpmml.evaluator.EvaluationExample --model model1.pmml --input input.csv --output output1.csv --missing-values "" --separator ","

probability(1) == model1_p1

java -cp pmml-evaluator-example-executable-1.4.12.jar org.jpmml.evaluator.EvaluationExample --model model2.pmml --input input.csv --output output2.csv --missing-values "" --separator ","

probability(1) != model2_p1 :( ???

vruusmann commented 4 years ago

IIRC, the JPMML-LightGBM library does not check the value of the boosting_type attribute.

Therefore, it encodes "gbdt" and "rf" boosting types identically, following the "gbdt" procedure. Based on the above evidence, there is a need to detect "rf" boosting type, and do something differently.

OnlyFor commented 4 years ago

thx,

different boosting_types can be found in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html

boosting_type (string, optional (default='gbdt')) – ‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradient-based One-Side Sampling. ‘rf’, Random Forest.

vruusmann commented 4 years ago

@OnlyFor Open model2.pmml in text editor, and on line 143 change the value of Segmentation@multipleModelMethod attribute from sum (gbdt) to average (rf).

Then you have correct RF predictions.

OnlyFor commented 4 years ago

@vruusmann it works ! thx !

vruusmann commented 4 years ago

it works!

Just made this comment to show that the fix for "rf" booster type is really simple. Will probably implement it in code later this week.

jpmml / jpmml-lightgbm

Support lightgbm (boosting_type="rf") ? #32