Closed sakura7621 closed 5 years ago
To improve the generalization ability, some categories with little data will be ignored in the model. This is a feature, not a bug.
@guolinke Thanks for the quick response.
Is there any way to save the text file with all of values? I need a full information before converting the text
to pmml
version.
How about the second question, if there are no official ways to keep all the values, is it safe when I manually change the values in feature_infos section of the text file, such as manually add "4" value to occupation
feature in the example?
Thanks.
This issue describes a workflow Python/Scikit-Learn -> LGBM text file -> JPMML-LightGBM
, which relies on model "data schema" information as stored in the LGBM text file.
A much better workflow would be Python/Scikit-Learn -> SkLearn2PMML
, which gets the model "data schema" straight from the Scikit-Learn pipeline. So, even if the LightGBM algorithm decides that some category levels are insignificant and does not store them in the intermediate LGBM text file, the SkLearn2PMML converter still knows about them!
Example workflow:
from sklearn2pmml.decoration import CategoricalDomain
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
("my_categorical_column", [CategoricalDomain(), LabelBinarizer()])
])),
("regressor", LGBMRegressor())
])
pipeline.fit(X, y)
sklearn2pmml(pipeline, "pipeline.pmml")
@vruusmann: Thank you so much for your advice. It really helped me a lot. However I still have a problem with the new method. I applied your suggestion to the example (at the end of the file.
I copy some pieces of code below. When I calculate log_loss
error for the pipeline
model, it returns a different result compared to the normal way. I'm wondering how can we check the model created by pipeline
to make sure it yields the same model with lightgbm
's? The second thing I need your help is that why the 2 models returns different log_loss
error. What am I missing here? Thanks a lot.
mapper = DataFrameMapper([
('Employment', CategoricalDomain()),
('Education', CategoricalDomain()),
('Marital', CategoricalDomain()),
('Occupation', CategoricalDomain()),
('Gender', CategoricalDomain()),
('Deductions', CategoricalDomain()),
(['Hours', 'Income', 'Age'], ContinuousDomain(with_data = False))
])
classifier = lgb.LGBMClassifier(n_estimators=2, learning_rate=0.1, num_leaves=10, max_depth=2)
pipeline = PMMLPipeline([
("mapper", mapper),
("classifier", classifier)
])
pipeline.fit(X = X_train, y = y_train)
sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml")
## compare the log_loss:
log_loss(y_train, pipeline.predict_proba(X_train)) # 0.509684592242956
log_loss(y_train, lgb_sklearn_model.predict_proba(X_train)) # 0.4975041326312271
@chaupmcs Had forgotten about this, but you also need to suppress LGBM's default "categorical feature auto-detection"-algorithm by supplying the list of categorical column indices as the categorical_feature
attribute.
So, my above Python code example should really look like this:
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([ .. ])),
("regressor", LGBMRegressor())
])
# THIS: specify '<estimator step name>__categorical_feature' kwarg
pipeline.fit(X, y, regressor__categorical_feature = [0])
Complete example here: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L206-L226
Pay attention to this line: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L226
@vruusmann Thanks for pointing out the problem.
Now the pipeline
model returns the same log_loss
error as lightgbm
model does.
But, I can not save the model to pmml
format anymore.
params = {'classifier__categorical_feature': [2, 3, 4, 5] } # add this line
pipeline.fit(X = X_train, y = y_train, **params) # put the params into fit()
pipeline = make_pmml_pipeline(pipeline, X_train.columns.values, y_train.name) # add this line
pipeline.predict_proba(X_train) # ok, no errors, returns the same with lightgbm model
sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml") # Error here
In the link you gave me, I see you're using store_pkl
to dump to pickle
files, no example for pmml
format. I don't know how the error comes. Please help me to fix it. Thank you!
---- The error below (full the example code and result here)
Standard output is empty Standard error: Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run INFO: Parsing PKL.. Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run INFO: Parsed PKL in 30 ms. Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run INFO: Converting.. Jan 31, 2019 5:58:04 PM org.jpmml.sklearn.Main run SEVERE: Failed to convert java.lang.IndexOutOfBoundsException: Index: 13, Size: 13 at java.util.ArrayList.rangeCheck(ArrayList.java:657) at java.util.ArrayList.get(ArrayList.java:433) at org.jpmml.lightgbm.Tree.selectValues(Tree.java:240) at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:151) at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:186) at org.jpmml.lightgbm.Tree.encodeTreeModel(Tree.java:94) at org.jpmml.lightgbm.ObjectiveFunction.createMiningModel(ObjectiveFunction.java:66) at org.jpmml.lightgbm.BinomialLogisticRegression.encodeMiningModel(BinomialLogisticRegression.java:49) at org.jpmml.lightgbm.GBDT.encodeMiningModel(GBDT.java:287) at lightgbm.sklearn.BoosterUtil.encodeModel(BoosterUtil.java:58) at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:39) at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:26) at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215) at org.jpmml.sklearn.Main.run(Main.java:145) at org.jpmml.sklearn.Main.main(Main.java:94)
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 13, Size: 13 at java.util.ArrayList.rangeCheck(ArrayList.java:657) at java.util.ArrayList.get(ArrayList.java:433) at org.jpmml.lightgbm.Tree.selectValues(Tree.java:240) at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:151) at org.jpmml.lightgbm.Tree.encodeNode(Tree.java:186) at org.jpmml.lightgbm.Tree.encodeTreeModel(Tree.java:94) at org.jpmml.lightgbm.ObjectiveFunction.createMiningModel(ObjectiveFunction.java:66) at org.jpmml.lightgbm.BinomialLogisticRegression.encodeMiningModel(BinomialLogisticRegression.java:49) at org.jpmml.lightgbm.GBDT.encodeMiningModel(GBDT.java:287) at lightgbm.sklearn.BoosterUtil.encodeModel(BoosterUtil.java:58) at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:39) at lightgbm.sklearn.LGBMClassifier.encodeModel(LGBMClassifier.java:26) at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215) at org.jpmml.sklearn.Main.run(Main.java:145) at org.jpmml.sklearn.Main.main(Main.java:94)
RuntimeError Traceback (most recent call last)
in () 28 pipeline = make_pmml_pipeline(pipeline, X_train.columns.values, y_train.name) 29 ---> 30 sklearn2pmml(pipeline, base_link + "/pipeline_2.pmml") 31 32 pipeline.predict_proba(X_train) ~/.local/lib/python3.6/site-packages/sklearn2pmml/__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug, java_encoding) 244 print("Standard error is empty") 245 if retcode: --> 246 raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams") 247 finally: 248 if debug: RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams
@chaupmcs It's probably this line, which is completely unnecessary: pipeline = make_pmml_pipeline(pipeline)
Anyway, this dicussion is getting off-topic, and we should move it to JPMML's "namespace" instead.
@vruusmann Thank you for the advice. I tried comment out the line but the error still occurred. I will close this topic and open a new one in JPMML. Thanks again 👍
Hi, I have a problem with saving lgb model to text file.
My purpose: After using
python
to train a model, I want to save it to a text file, and then use jpmml to convert the text file topmml
for java application).When saving the model to text file, I realize that in some categorical features, the text file lacks some values. For example, a feature in training data can receive values {1, 2, 3, 4, 5, 6}, but in the text file, at feature_infos section, it just is {1, 2, 3, 5, 6} (lack of {4}). I don't know why it happens. I use 2 lgb models, sklearn and standalone ones but both txt files seem don't contain {4}.
My questions are: 1, Is it a bug of lgb ? 2, Is feature_infos important? Does it have any impact on the model when I manually change it ?
I provide the code (jupyter-notebook) and the data here
Thank you.