jpmml / jpmml-lightgbm

Java library and command-line application for converting LightGBM models to PMML
GNU Affero General Public License v3.0
174 stars 58 forks source link

Detect and normalize poorly encoded categorical splits #35

Closed vruusmann closed 3 years ago

vruusmann commented 4 years ago

When performing the binarization of categorical features (eg. using LabelBinarizer) instead of integer-encoding them (eg. using LabelEncoder), then splits of categorical values are encoded as double comparisons against a reference value 1.0000000180025095E-35 (the smallest 64-bit value that is still greater than 0):

<Node id="8" score="0.0745191789134865" recordCount="39">
            <SimplePredicate field="lookup(Employment)" operator="lessOrEqual" value="1.0000000180025095E-35"/>
</Node>

It would be much more transparent and space efficient to encode the same as integer comparisons against 0 and 1 reference values:

<Node id="8" score="0.0745191789134865" recordCount="39">
            <SimplePredicate field="lookup(Employment)" operator="equal" value="0"/>
</Node>

and/or:

<Node id="8" score="0.0745191789134865" recordCount="39">
            <SimplePredicate field="lookup(Employment)" operator="notEqual" value="1"/>
</Node>