jpmml / jpmml-xgboost

Java library and command-line application for converting XGBoost models to PMML
GNU Affero General Public License v3.0
128 stars 43 forks source link

How to add recordCount and algorithm name to xgboost-pmml ? #25

Closed USCYuandaDu closed 6 years ago

USCYuandaDu commented 6 years ago

Hi, the jpmml-xgboost is very useful in my project. Because I want to share my trained model. The question is I do not know how to add recordCount into Tree Node attribute and I also want to add algorithm name.

Thank you!

Bests, Yuanda

vruusmann commented 6 years ago

The question is I do not know how to add recordCount into Tree Node attribute

This information should be available during the conversion via org.jpmml.xgboost.GBTree -> List<RegTree> -> List<NodeStat> -> NodeStat#leaf_child_cnt. Just need to fetch the right NodeStat object, and set the Node@recordCount attribute (inside the RegTree#encodeNode(...) method).

I also want to add algorithm name.

You mean setting the MiningModel@algorithmName attribute for the top-level model element?

The name of the JPMML conversion library is generally stored in the /PMML/Header/Application element. If you use the JPMML-XGBoost library or command-line application directly, then it should report "JPMML-XGBoost 1.3.1" at the moment.

However, if you use JPMML-XGBoost via JPMML-R or JPMML-SkLearn wrapper libraries, then the original XGBoost-generated Application element is replaced with a R- or SkLearn-generated Application element. The solution would be to make wrapper libraries smarter, so that they would preserve the original converter library name. For example, something like JPMML-XGBoost 1.3.1 via JPMML-R 1.3.10 would be quite nice.

vruusmann commented 6 years ago

Was just blurting out my initial thoughts above.

As a solution, it should be possible to enable/disable the generation of Node@recordCount (and Node@id) attributes via conversion options.

Also, it would be trivial to set MiningModel@algorithmName="XGBoost". But something also needs to be done at JPMML-R and JPMML-SkLearn library level to preserve the contents of the original /PMML/Header/Application element.

USCYuandaDu commented 6 years ago

Thank you for your reply. I got your point. And I was thinking can I add the attribute with using xml directly. Just a joke, haha.

vruusmann commented 6 years ago

And I was thinking can I add the attribute with using xml directly.

Some Java programming will be definitely necessary.

However, if you have an existing PMML document (generated by whatever tree or tree ensemble learner algorithm), then it's possible to generate Node@recordCount programmatically for the dataset at hand.

Pseudo-code workflow:

org.jpmml.evaluator.Evaluator evaluator = ...
org.jpmml.evaluator.TargetField targetField = Iterables.getOnlyElement(evaluator.getTargetFields());

List<Map<FieldName, ?>> arguments = ...
for(Map<FieldName, ?> argument : arguments){
  Map<FieldName, ?> result = evaluator.evaluate(arguments);
  Object targetValue = result.get(targetField.getName());
  // Marker interface org.jpmml.evaluator.tree.HasDecisionPath was introduced in JPMML-Evaluator version 1.4.2
  if(targetValue instanceof HasDecisionPath){
    HasDecisionPath hasDecisionPath = (HasDecisionPath)targetValue;

    List<Node> pathNodes = hasDecisionPath.getDecisionPath();
    for(Node pathNode : pathNodes){
      // Increment record counts by one from the root node to the winning node
      pathNode.setRecordCount(pathNode.getRecordCount() + 1);
    }
  }
}
USCYuandaDu commented 6 years ago

Hi, vruusmann, It's very kind of you to help me with the Psudo-code. It's really helpful. Since I was playing with xgboost. And in pmml, each tree was represented as <segment id=""> and each node was represented as <Node id="">. If I figure out the record count=100 of the node(id = 10) which was in segment (id = 0). How can I set the recordCount to the Node so that the result line of that pmml could be <Node id="10" recordCount="100" score =".....">. And It is prefect if you could tell me the method with scala.

Thank you! Bests, Yuanda

vruusmann commented 6 years ago

The above pseudo-code works with individual tree models (eg. DecisionTreeClassifier/DecisionTreeRegressor). There's no JPMML-Evaluator API yet for dealing with tree model ensembles (eg. RandomForestClassifier/RandomForestRegressor and XGBoostClassifier/XGBoostRegressor). Maybe it will be possible to introduce a new interface org.jpmml.evaluator.tree.HasDecisionPathEnsemble or smth like that in the next JPMML-Evaluator version. Worth considering.

For getting record counts today, the simplest solution would be to modify the JPMML-XGBoost library as suggested in my first comment.

vruusmann commented 6 years ago

So my question is that is the Node's id same between your code and xgboost?

My code assigns assigns 1-based node indices, I have no idea if XGBoost uses the same indexing approach, or something else (eg. 0-based node indices): https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/java/org/jpmml/xgboost/RegTree.java#L106

However, please note that when the XGBoost model is compacted, then nodes are moved to different locations (eg. pulled up one or two levels), and their indices are set to null: https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/java/org/jpmml/xgboost/visitors/TreeModelCompactor.java#L118

Such XGBoost model compaction would also make the original record counts meaningless. So, if you want to have node identifiers and record counts, then you must first disable XGBoost model compaction: https://github.com/jpmml/jpmml-xgboost/blob/master/src/main/java/org/jpmml/xgboost/Main.java#L89-L93

USCYuandaDu commented 6 years ago

Thank you for your fast reply. You answer was helpful and awesome. By looking through your code, I found that xgboost JPMML model got trained booster information from XGBoostDataInput. The idea was awesome and I was wonder how do you know the format of the Datainput(I can not find any document describes the output format of Booster.toArray() ). I also wonder if we could found recordCount in XGBoostDataInput ?

Bests, Yuanda

USCYuandaDu commented 6 years ago

I also wonder how could you match the feature schema? It seems complex. Thank you for your time.

vruusmann commented 6 years ago

I also wonder if we could found recordCount in XGBoostDataInput?

I'm sorry to inform you that the binary XGBoost model object does not contain "record counts" information. Therefore, the only way to annotate the PMML data structure with this information (eg. setting the Node@recordCount attribute) appears to be programmatic, similar to what is proposed in https://github.com/jpmml/jpmml-xgboost/issues/25#issuecomment-400937506