jpmml / jpmml-converter

Java library for authoring PMML
GNU Affero General Public License v3.0
15 stars 4 forks source link

Controlling scientific notation in PMML document #19

Closed rjphofmann closed 3 years ago

rjphofmann commented 3 years ago

Hello,

I've been actively using the PySpark2PMML package to write RF spark models into PMML documents, and was just noticing that sometimes I get scientific notation in the output:

 < ScoreDistribution value="0" recordCount="2.3252954E7" />

Is there a way to control whether or not scientific notation is used in the output? I'd prefer that it isn't used, as my C++ parser isn't written to accept it. Thanks!

Patrick Hofmann

vruusmann commented 3 years ago

The original question was asked with Apache Spark ML in mind, but the same functionality would come in handy across all JPMML-family conversion libraries (R, Scikit-Learn, etc).

At minimum, the JPMML-Converter library could provide reusable Visitor classes for transforming PMML attributes between java.lang.Number types (eg. from java.lang.Double to java.lang.Long or java.math.BigDecimal).

Related discussion in the JPMML mailing list: https://groups.google.com/forum/#!topic/jpmml/-YKzSnWkN78

vruusmann commented 3 years ago

Alternative view - this Visitor class could be performing an "optimize the type of java.lang.Number attributes values". For example, in order to save memory, small integer values could be transformed from java.lang.Integer (or java.lang.Long) to java.lang.Byte or java.langShort.

The PMML class model should be "indifferent" to such value type changes.

Once the Visitor class is ready, it could be made default by inserting it into the org.jpmml.converter.visitors.PMMLCleanerBattery: https://github.com/jpmml/jpmml-converter/blob/1.4.2/src/main/java/org/jpmml/converter/ModelEncoder.java#L96-L97