microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

Lack some values when saving lgb to text file #1983

Closed sakura7621 closed 5 years ago

sakura7621 commented 5 years ago

Hi, I have a problem with saving lgb model to text file.

My purpose: After using python to train a model, I want to save it to a text file, and then use jpmml to convert the text file to pmml for java application).

When saving the model to text file, I realize that in some categorical features, the text file lacks some values. For example, a feature in training data can receive values {1, 2, 3, 4, 5, 6}, but in the text file, at feature_infos section, it just is {1, 2, 3, 5, 6} (lack of {4}). I don't know why it happens. I use 2 lgb models, sklearn and standalone ones but both txt files seem don't contain {4}.

My questions are: 1, Is it a bug of lgb ? 2, Is feature_infos important? Does it have any impact on the model when I manually change it ?

I provide the code (jupyter-notebook) and the data here

Thank you.

guolinke commented 5 years ago

To improve the generalization ability, some categories with little data will be ignored in the model. This is a feature, not a bug.

sakura7621 commented 5 years ago

@guolinke Thanks for the quick response. Is there any way to save the text file with all of values? I need a full information before converting the text to pmml version. How about the second question, if there are no official ways to keep all the values, is it safe when I manually change the values in feature_infos section of the text file, such as manually add "4" value to occupation feature in the example? Thanks.

vruusmann commented 5 years ago

This issue describes a workflow Python/Scikit-Learn -> LGBM text file -> JPMML-LightGBM, which relies on model "data schema" information as stored in the LGBM text file.

A much better workflow would be Python/Scikit-Learn -> SkLearn2PMML, which gets the model "data schema" straight from the Scikit-Learn pipeline. So, even if the LightGBM algorithm decides that some category levels are insignificant and does not store them in the intermediate LGBM text file, the SkLearn2PMML converter still knows about them!

Example workflow:

from sklearn2pmml.decoration import CategoricalDomain

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([
    ("my_categorical_column", [CategoricalDomain(), LabelBinarizer()])
  ])),
  ("regressor", LGBMRegressor())
])
pipeline.fit(X, y)

sklearn2pmml(pipeline, "pipeline.pmml")
sakura7621 commented 5 years ago

@vruusmann: Thank you so much for your advice. It really helped me a lot. However I still have a problem with the new method. I applied your suggestion to the example (at the end of the file.

I copy some pieces of code below. When I calculate log_loss error for the pipeline model, it returns a different result compared to the normal way. I'm wondering how can we check the model created by pipeline to make sure it yields the same model with lightgbm's? The second thing I need your help is that why the 2 models returns different log_loss error. What am I missing here? Thanks a lot.

mapper = DataFrameMapper([
   ('Employment', CategoricalDomain()),
   ('Education', CategoricalDomain()),
   ('Marital', CategoricalDomain()),
   ('Occupation', CategoricalDomain()),
   ('Gender', CategoricalDomain()),
   ('Deductions', CategoricalDomain()),
   (['Hours', 'Income', 'Age'], ContinuousDomain(with_data = False))
 ])

classifier = lgb.LGBMClassifier(n_estimators=2, learning_rate=0.1, num_leaves=10, max_depth=2)

pipeline = PMMLPipeline([
   ("mapper", mapper),
   ("classifier", classifier)
])

pipeline.fit(X = X_train, y = y_train)
sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml")

## compare the log_loss:
log_loss(y_train, pipeline.predict_proba(X_train))    # 0.509684592242956
log_loss(y_train, lgb_sklearn_model.predict_proba(X_train))    # 0.4975041326312271
vruusmann commented 5 years ago

@chaupmcs Had forgotten about this, but you also need to suppress LGBM's default "categorical feature auto-detection"-algorithm by supplying the list of categorical column indices as the categorical_feature attribute.

So, my above Python code example should really look like this:

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([ .. ])),
  ("regressor", LGBMRegressor())
])
# THIS: specify '<estimator step name>__categorical_feature' kwarg
pipeline.fit(X, y, regressor__categorical_feature = [0])

Complete example here: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L206-L226

Pay attention to this line: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L226

sakura7621 commented 5 years ago

@vruusmann Thanks for pointing out the problem. Now the pipeline model returns the same log_loss error as lightgbm model does. But, I can not save the model to pmml format anymore.

params = {'classifier__categorical_feature': [2, 3, 4, 5] }   # add this line
pipeline.fit(X = X_train, y = y_train, **params)     # put the params into  fit() 
pipeline = make_pmml_pipeline(pipeline, X_train.columns.values, y_train.name)      # add this line
pipeline.predict_proba(X_train)   # ok, no errors, returns the same with lightgbm model

sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml")    # Error here

In the link you gave me, I see you're using store_pkl to dump to pickle files, no example for pmml format. I don't know how the error comes. Please help me to fix it. Thank you!

---- The error below (full the example code and result here)

Standard output is empty Standard error: Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run INFO: Parsing PKL.. Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run INFO: Parsed PKL in 30 ms. Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run INFO: Converting.. Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run SEVERE: Failed to convert java.lang.IndexOutOfBoundsException: Index: 13, Size: 13 at java.util.ArrayList.rangeCheck(ArrayList.java:657) at java.util.ArrayList.get(ArrayList.java:433) at org.jpmml.lightgbm.Tree.selectValues(Tree.java:240) at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:151) at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:186) at org.jpmml.lightgbm.Tree.encodeTreeModel(Tree.java:94) at org.jpmml.lightgbm.ObjectiveFunction.createMiningModel(ObjectiveFunction.java:66) at org.jpmml.lightgbm.BinomialLogisticRegression.encodeMiningModel(BinomialLogisticRegression.java:49) at org.jpmml.lightgbm.GBDT.encodeMiningModel(GBDT.java:287) at lightgbm.sklearn.BoosterUtil.encodeModel(BoosterUtil.java:58) at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:39) at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:26) at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215) at org.jpmml.sklearn.Main.run(Main.java:145) at org.jpmml.sklearn.Main.main(Main.java:94)

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 13, Size: 13 at java.util.ArrayList.rangeCheck(ArrayList.java:657) at java.util.ArrayList.get(ArrayList.java:433) at org.jpmml.lightgbm.Tree.selectValues(Tree.java:240) at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:151) at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:186) at org.jpmml.lightgbm.Tree.encodeTreeModel(Tree.java:94) at org.jpmml.lightgbm.ObjectiveFunction.createMiningModel(ObjectiveFunction.java:66) at org.jpmml.lightgbm.BinomialLogisticRegression.encodeMiningModel(BinomialLogisticRegression.java:49) at org.jpmml.lightgbm.GBDT.encodeMiningModel(GBDT.java:287) at lightgbm.sklearn.BoosterUtil.encodeModel(BoosterUtil.java:58) at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:39) at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:26) at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215) at org.jpmml.sklearn.Main.run(Main.java:145) at org.jpmml.sklearn.Main.main(Main.java:94)


RuntimeError Traceback (most recent call last)

in () 28 pipeline = make_pmml_pipeline(pipeline, X_train.columns.values, y_train.name) 29 ---> 30 sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml") 31 32 pipeline.predict_proba(X_train) ~/.local/lib/python3.6/site-packages/sklearn2pmml/__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug, java_encoding) 244 print("Standard error is empty") 245 if retcode: --> 246 raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams") 247 finally: 248 if debug: RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams
vruusmann commented 5 years ago

@chaupmcs It's probably this line, which is completely unnecessary: pipeline = make_pmml_pipeline(pipeline)

Anyway, this dicussion is getting off-topic, and we should move it to JPMML's "namespace" instead.

sakura7621 commented 5 years ago

@vruusmann Thank you for the advice. I tried comment out the line but the error still occurred. I will close this topic and open a new one in JPMML. Thanks again 👍