jpmml / jpmml-xgboost

Java library and command-line application for converting XGBoost models to PMML
GNU Affero General Public License v3.0
128 stars 43 forks source link

The `--fmap-input` command-line option may not have effect #74

Closed solutionjh closed 1 month ago

solutionjh commented 4 months ago

jpmml-xgboost version: 1.8.5

XGBoost Version and Model

Use Case

Error Stack Trace

Failed to convert learner to PMML
java.lang.IndexOutOfBoundsException: Index: 40, Size: 0
  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
  at java.util.ArrayList.get(ArrayList.java:433)
  at org.jpmml.converter.Schema.getFeature(Schema.java:141)
  at org.jpmml.xgboost.RegTree.encodeNode(RegTree.java:285)
  at org.jpmml.xgboost.RegTree.encodeTreeNode(RegTree.java:267)
  at org.jpmml.xgboost.ObjFunction.createMiningModel(ObjFunction.java:155)
  at org.jpmml.xgboost.BinomialLogisticRegression.encodeModel(BinomialLogisticRegression.java:46)
  at org.jpmml.xgboost.Learner.encodeModel(Learner.java:454)
  at org.jpmml.xgboost.Learner.encodeModel(Learner.java:446)
  at org.jpmml.xgboost.encodePMML(Learner.java:434)
...(My Custom Code)

Debugging result

vruusmann commented 4 months ago

Use Case

  • Did not use feature map when train model, so model.json files has item "feature_names":[],"feature_types":[].
  • Model use 40 items.

These are two XGBoost model object states:

  1. The model does not have feature_names and feature_types fields defined at all. In JPMML-XGBoost, this state is mapped to feature_names = null and feature_types = null. This state happens with XGBoost 1.0 -- 1.5, if I remember correctly.
  2. These fields are defined, but they do not have any contents. In JPMML-XGBoost, this state is mapped to feature_names = [] and feature_types = [].

Debugging result

  • Learner.java:355 only check null
  • feature_names and feature_types has empty array, so pass this check logic
  • --fmap-input command-line option omitted

In case of incomplete embedded XGBoost model schema information, it is your responsibility to provide it externally, using the --fmap-input command-line option.

Alternatively, you may edit the XGBoost model file programmatically, and set the values of feature_names and feature_types field to non-null/non-empty state. If I understand you correctly, then there are supposed to be 40 elements on each of them.

JPMML-XGBoost does not make any attempts to "guess" the model schema for you.

vruusmann commented 4 months ago

@solutionjh Please elaborate, what do you expect the JPMML-XGBoost converter to do instead of throwing an IOOBE.

How do you know that there are supposed to be 40 features? Why is this information not included into the XGBoost model file, why is it kept separate?

solutionjh commented 4 months ago

@vruusmann Thank you for your response. I think Learner.java:355 needs additional check logic like this.feature_names.length == 0 || this.feature_types == 0 then user can use --fmap-input option using fmap file for update information. In my case, delete feature_names=[], feature_types=[] in model json, and use --fmap-input option for update information.

Thanks for a great bridge module for python ML to java application!

vruusmann commented 4 months ago

then user can use --fmap-input option using fmap file for update information.

Updated the title of this issue accordingly - the problem is that the --fmap-input does not have any effect (when feature_names = [] and feature_types = [])?

Perhaps there should be an additional command-line flag for stating "ignore the embedded FMap, only use the user-provided FMap".

solutionjh commented 4 months ago

Updated the title of this issue accordingly - the problem is that the --fmap-input does not have any effect (when feature_names = [] and feature_types = [])?

Sure, XGBoost work well when feature_names and feature_type are empty. Additional flag or empty check give good user experience.

Moreover, it is good for user to give a message about feature_map and feature_type are empty.

Have a nice time~~