jpmml / jpmml-transpiler

Java Transpiler (Translator + Compiler) API for PMML
GNU Affero General Public License v3.0
28 stars 2 forks source link

Double datatype is used within switch-case statement in generated java source #21

Open denmase opened 1 year ago

denmase commented 1 year ago

Hi @vruusmann,

While testing for your release after #20, I tried to transpile a folder which contains many pmml files, and accidentally, one of them was from Extending Scikit-Learn with CHAID model type, and it was the only file which wasn't compiled properly with following error:

Transpiling: CHAIDIris.pmml
META-INF/services\org.dmg.pmml.PMML
com\mycompany\MyModel.java
com\mycompany\PMML$1041109062.data
/com/mycompany/MyModel.java:188: error: incompatible types: java.lang.Double cannot be converted to int
/com/mycompany/MyModel.java:293: error: incompatible types: java.lang.Double cannot be converted to int
 ==> FAILED.
java.io.IOException
    at org.jpmml.codemodel.CompilerUtil.compile(CompilerUtil.java:89)
    at org.jpmml.transpiler.TranspilerUtil.compile(TranspilerUtil.java:103)
    at org.jpmml.transpiler.TranspilerUtil.compile(TranspilerUtil.java:81)
    at com.mycompany.mavenproject1.Main.transpilePmml(Main.java:242)
    at com.mycompany.mavenproject1.Main.main(Main.java:130)

Lines 188 & 293 are both a switch-case statement

      Double javaValue = value.asDouble();
      switch (javaValue) { // <== this is line 188, 293

I don't know whether it is changed in Java 9 or later, but at least AFAIK in Java 8, we cannot use double in switch-case statement. Consider using if instead?

Thank you.

vruusmann commented 1 year ago

one pmml was from Extending Scikit-Learn with CHAID model type and it was the only file which wasn't compiled properly

Can you attach this problematic PMML file here? The referenced blog post deals with many CHAID model configurations, so it would be extra work to generate and test them all.

Very interesting that you're going through my work so systematically.

Lines 188 & 293 are both a switch-case statement

Yes, the switch statement does not support floating-point values.

Right now I would say that it's more like a PMML converter problem - it should be automatically "downcasting" categorical double and float fields to int fields.

Consider using if instead?

If it's a one-to-one mapping, then it might be a better idea to generate a plain dictionary lookup in that place. Something like Map<Double, ?>. The JPMML-Transpiler library already has some "infrastructure" in place when it comes to moving such stuff into external resource files (in order to avoid blowing up Java class file constant tables).

The Java compiler will also be translating (smaller-) switch statements to dictionary lookups.

denmase commented 1 year ago

Please find attached the generated PMML and the java file. Also, a Decision Tree-based PMML (of which, I believe, shares the same structure, but transpiled to very different java source). Hence, I was wondering why it was "treated" differently by the translator. So, it does make sense that it might be problem of PMML converter instead.

The problematic PMML was generated using following code:

from sklearn_pandas import DataFrameMapper
from sklearn.datasets import load_iris
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.decoration import CategoricalDomain
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.tree.chaid import CHAIDClassifier

def make_passthrough_mapper(cols):
    return DataFrameMapper(
        [([col], CategoricalDomain()) for col in cols]
    )

def make_classifier(max_depth = 5):
    config = {
        "max_depth" : max_depth
    }
    return CHAIDClassifier(config = config)

iris_X, iris_y = load_iris(return_X_y = True, as_frame = True)

pipeline = PMMLPipeline([
    ("mapper", make_passthrough_mapper(iris_X.columns.values)),
    ("classifier", make_classifier(max_depth = 3))
])
pipeline.fit(iris_X, iris_y)

sklearn2pmml(pipeline, "pmmlIris/CHAIDIris.pmml")

I like your project very much and it should gain more attention, but to contribute on coding is way out of my league, because of the complexity, coding style and my lack of coding skill. So, testing it is the very least that I can do to contribute.

pmmlIris.zip

vruusmann commented 1 year ago

Got it - this issue manifests itself, because the CHAID model is trained on categorical double fields. Casting to int is not an option here, because the original input values are fractional (eg. 0.8, 1.2), for which there is no integer equivalent available.

The quick workaround would be to cast categorical doubles to categorical strings - note the use of the CategoricalDomain.dtype attribute for this purpose:

def make_passthrough_mapper(cols):
    return DataFrameMapper(
        [([col], CategoricalDomain(dtype = str)) for col in cols]
    )

The actual fix must therefore also go in the direction of a dictionary lookup.

vruusmann commented 1 year ago

@denmase I've seen you inquiring several GitHub scorecard projects about PMML support.

Do you have any specific workflows in mind? If you do, then feel free to shoot me an e-mail, and maybe we can join our forces.

denmase commented 1 year ago

Great that you found the problem straight away.

I don't have any clear and specific workflows in mind yet (at least for now). However, my broad wish is to have easy and seamless pipeline in model development, implementation, and monitoring, for both traditional model (e.g., scorecard, segmentation trees, BR, etc.) and ML model. I found your project about a year ago and then found out later that jpmml-evaluator is used in a product (the vendor must be your client) which we use. I somewhat "dislike" the way they used your library, in term of upgradability (to use more recent version of your library is almost impossible without recompile), so I am trying to find other way to implement models in faster and efficient way without sacrificing upgradability. My work is about decisioning automation, it must be something that very familiar to you and you know what I'm talking about.

vruusmann commented 1 year ago

I am trying to find other way to implement models in faster and efficient way without sacrificing upgradability.

The Java code that is generated by JPMML-Transpiler is still hard-coupled to the underlying JPMML-Evaluator library platform version. For example, the latest JPMML-Transpiler 1.3.1 assumes that it can freely use JPMML-Evaluator 1.6.4 API classes and methods.

The upgradeability of Java libraries is production systems is a serious matter - falls to the category of "don't fix it unless it's (completely-) broken".

Anyway, your comment just sparked a few ideas in my head. It should be possible to create an extra layer into the JPMML-Evaluator library hierarchy, which would allow referencing a specific JPMML-Evaluator JAR file as "use this JAR version for running this model". Very easy to pull it off using the builder pattern!

Something like this:

org.jpmml.evaluator.Evaluator evaluator = new LoadingEvaluatorBuilder()
  # THIS!
  .setRuntimeJar("/path/to/jpmml-evaluator-${version}")
  .load("MyModel.pmml.xml")
  .build();

The org.jpmml.evaluator.Evaluator interface has been very stable over the years; the only breaking change was between 1.5.X and 1.6.X, where I got rid of the org.dmg.pmml.FieldName class (in arguments and results Map type declarations).

denmase commented 1 year ago

Anyway, your comment just sparked a few ideas in my head. It should be possible to create an extra layer into the JPMML-Evaluator library hierarchy, which would allow referencing a specific JPMML-Evaluator JAR file as "use this JAR version for running this model". Very easy to pull it off using the builder pattern!

Wow.. that's cool. Please make it configurable in a way so third party software, which uses jpmml-evaluator, can expose a way to enable end user to configure which JAR file to be used, or something like that.

The upgradeability of Java libraries is production systems is a serious matter - falls to the category of "don't fix it unless it's (completely-) broken".

I know, and this is actually what happen in my case, it isn't broken, but it isn't working either, because it uses older version of jpmml-evaluator (supports up to PMML v4.3). Downgrading to v4.3 does work for some cases, but in other cases it just doesn't work, while asking for upgrade from vendor isn't an option either. So I am locked and stuck. It's a real pain in the rear for model implementer (like me). Modeller only says "Hey, I've built the model, I don't care about the implementation, that's YOUR problem.".

Btw, I'll probably drop you an email or two, to brainstorm about my wish I spoke about in more detail, if it's okay to you.