jpmml / jpmml-lightgbm

Java library and command-line application for converting LightGBM models to PMML
GNU Affero General Public License v3.0
174 stars 58 forks source link

Error converting mode output txt to PMML #27

Closed TGalaxy closed 4 years ago

TGalaxy commented 4 years ago

Got the following error when converting txt to PMML

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 122, Size: 1
    at java.util.ArrayList.rangeCheck(Unknown Source)
    at java.util.ArrayList.get(Unknown Source)
    at org.jpmml.lightgbm.GBDT.encodeSchema(GBDT.java:233)
    at org.jpmml.lightgbm.GBDT.encodePMML(GBDT.java:384)
    at org.jpmml.lightgbm.Main.run(Main.java:132)
    at org.jpmml.lightgbm.Main.main(Main.java:118)

For security reason I couldn't attach the model txt file. But could you explain what the error means? Trying to see if I can give you a toy example

vruusmann commented 4 years ago

But could you explain what the error means?

It means that your LightGBM model text file is internally inconsistent - there is a hint that some attribute should contain at least 123 elements, but the parser only finds a single element.

As the exception happens during schema parsing, then I believe there's something wrong with the specification of categorical columns.

For security reason I couldn't attach the model txt file

Then you need to debug this issue locally.

Trying to see if I can give you a toy example

Keeping this issue open for a couple of days. If I don't see a reproducible example during that timeframe, then I'll close it as "invalid".

vruusmann commented 4 years ago

As the exception happens during schema parsing, then I believe there's something wrong with the specification of categorical columns.

One shouldn't be working with LightGBM model text files directly.

I believe this exception would be avoided if you interacted with LightGBM using some high-level framework such as Scikit-Learn, which takes care of feature engineering and specification needs.

See https://openscoring.io/blog/2019/04/07/converting_sklearn_lightgbm_pipeline_pmml/

TGalaxy commented 4 years ago

Thanks for your reply. I was trying to created a toy example, i.e., selected a few features from the original data including the categorical feature. It works smoothly. However it still does not work with all the features.

Here is my code:

d_train = lgb.Dataset(train[feature_list], label=train.tag,categorical_feature=categorical_feature)
d_validation = lgb.Dataset(validation[feature_list],label=validation.tag,categorical_feature=categorical_feature)

model = lgb.train(params, d_train, valid_sets=d_validation, early_stopping_rounds=50, verbose_eval=100)
model.save_model('lgbm.txt', num_iteration=model.best_iteration)

I will force all the other features (other than categorical) to be float and run it again.

TGalaxy commented 4 years ago

I forced categorical features to be type category and others to be float64. However, I still got the same error

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 119, Size: 69
    at java.util.ArrayList.rangeCheck(Unknown Source)
    at java.util.ArrayList.get(Unknown Source)
    at org.jpmml.lightgbm.GBDT.encodeSchema(GBDT.java:233)
    at org.jpmml.lightgbm.GBDT.encodePMML(GBDT.java:384)
    at org.jpmml.lightgbm.Main.run(Main.java:132)
    at org.jpmml.lightgbm.Main.main(Main.java:118)
vruusmann commented 4 years ago

Your feature specification code is wrong. However, it's impossible for me to be any specific, because the posted exception stack trace(s) do not contain enough actionable information.

Closing as invalid/not reproducible.

etveritas commented 4 years ago

@vruusmann Hello, I also encounter this problem. And I count the pandas_categorical number is right, but when convert, it also out of bounds.where could the redundant number from?