Open wendycwong opened 1 month ago
According to @narasimhard : Customer already has written a library to translate H2O-3 mojo to PMML:
They are currently using a JAVA env to convert here is a reference: https://github.com/jpmml/jpmml-h2o?tab=readme-ov-file#the-java-side-of-operations
java -jar pmml-h2o-example/target/pmml-h2o-example-executable-1.2-SNAPSHOT.jar --mojo-input mojo.zip --pmml-output mojo.pmml 10:45 It using the JAR pmml-h2o-example-executable-1.2-SNAPSHOT.jar
Using Intellij, I was able to generate pmml from h2o-3 mojo using their org.jpmml.h2o.example.Main.java.
My idea here is to add more arguments to Main.java to if a specific argument is present: --fill-missing-values, we will generate PMML file with the preprocessing enabled.
From my reading on PMML, it is very easy to add missing value replacement. You need to add it to the mining schema.
You can also look at the overview of variable scoping in PMML:
This really can be done!!!
New info:
in GLMMojoModelBaseConverter.java, lines 86-111, it did ask for missing value treatment.
The ImputerUtil.java is the place to add missing value treatment!!! It is called in ln 103.
Inside ImputerUtil.java, line 40 is the way to add what replacement value to add if we want to deal with missing values. This will add content to the decorator field. Looks like this is all we need to do.
line 85 of converter.java is where the missing value treatment is added to the encoder as decorator.
schema = toMojoModelSchema(schema); // goto XGBoost....
With GLM mojo, the missing values are incorporated as a decorator:
Note the enum values are string but the numerical columns are double.
However in the .xml file, the replaced values are treated as strings no matter what column types we are looking at:
I used what @narasimhard has and made some changes so that the missing values are in the decorator:
Here is the code comparison between my code and that from @narasimhard . As you can see, I basically just copied what he has and added the missing values to decorator pattern:
Instead of having a missingDict member, I just read it in and then call ImputerUtil.encodeFeature to add the missing values into the decorator pattern.
Since the output of the two .xml files are very similar, I don't have an opinion on which one to use.
A customer wants to add simple pre-processing to XGBoost mojo. However here is the trick:
What I know is that we can add preprocessing to current model and use a flag to enable it as it would be disabled by default.
This is do-able. See comments below.