jpmml / jpmml-lightgbm

Java library and command-line application for converting LightGBM models to PMML
GNU Affero General Public License v3.0
174 stars 58 forks source link

IllegalArgumentException: Out of range: 12045138254372 #25

Closed 77QingLiu closed 5 years ago

77QingLiu commented 5 years ago

The following error appears when trying to covert a model.txt file to pmml file

Exception in thread "main" java.lang.IllegalArgumentException: Out of range: 12045138254372
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:202)
    at com.google.common.primitives.Ints.checkedCast(Ints.java:88)
    at org.jpmml.converter.ValueUtil.asInt(ValueUtil.java:80)
    at org.jpmml.converter.ValueUtil.asInteger(ValueUtil.java:88)
    at org.jpmml.lightgbm.LightGBMUtil$2.apply(LightGBMUtil.java:332)
    at org.jpmml.lightgbm.LightGBMUtil$2.apply(LightGBMUtil.java:324)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at org.jpmml.lightgbm.GBDT.encodePMML(GBDT.java:226)
    at org.jpmml.lightgbm.Main.run(Main.java:131)
    at org.jpmml.lightgbm.Main.main(Main.java:117)

the following is the model file model.txt

I deleted the BRANCHID field which contains value '12045138254372', and the model file has been converted successfully.

vruusmann commented 5 years ago

It is assumed that all category indices fit into 32-bit (aka integer) value space. Your value - 12045138254372 - doesn't.

For a quick workaround, you could consider reindexing your categories (do you really have 12045138254372 unique category levels)? For a true fix, the JPMML-LightGBM library could switch from 32-bit indexes to 64-bit indexes.

77QingLiu commented 5 years ago

Do you mean this category value is stored in integer in JPMML-LightGBM? but that value is a string and only have dozens of unique category levels

vruusmann commented 5 years ago

but that value is a string and only have dozens of unique category levels

Currently, the value space of your BRANCHID is defined like this:

<DataField name="BRANCHID" optype="categorical" dataType="integer">
    <Value value="1"/>
    <!-- Omitted other single-digit category levels -->
    <Value value="10"/>
    <!-- Omitted other two-digit category levels -->
    <Value value="2324"/>
    <!-- Omitted other four-digit category levels -->
    <Value value="12045138254372"/>
    <Value value="12045192433901"/>
    <Value value="12977508116706"/>
</DataField>

This field has integer data type. However, the last three category values don't fit into 32-bit integer value space.

You should re-label them.

vruusmann commented 5 years ago

Attached is a patchfile against JPMML-LightGBM version 1.2.9 that switches the conversion-time representation of "direct category indices" from 32-bit integers to 64-bit integers.

issue_25.patch.txt

When this patchfile is applied, your model.txt file can be converted. However, I find this switch from 32-bit to 64-bit "hackish", and don't apply it to the master branch now.

77QingLiu commented 5 years ago

Issue solved. Thanks very much for your support. But I'm wondering why this feature BRANCHID has integer datatype when in pandas dataframe is stored in categorical datatype. Shouldn't it be a character type?

vruusmann commented 5 years ago

But I'm wondering why this feature BRANCHID has integer datatype when in pandas dataframe is stored in categorical datatype.

The most likely explanation is that LightGBM is performing some sort of "data type detection", and since all category levels are parseable as integers, assumes that the intended data type of this column is integer (not string aka character).