h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.78k stars 1.99k forks source link

How to add pre-processing to PMML #16199

Open wendycwong opened 1 month ago

wendycwong commented 1 month ago

A customer wants to add simple pre-processing to XGBoost mojo. However here is the trick:

  1. Customer has old mojo with earlier H2O-3 version;
  2. customer converted mojo to PMML version;

What I know is that we can add preprocessing to current model and use a flag to enable it as it would be disabled by default.

This is do-able. See comments below.

wendycwong commented 1 month ago

According to @narasimhard : Customer already has written a library to translate H2O-3 mojo to PMML:

They are currently using a JAVA env to convert here is a reference: https://github.com/jpmml/jpmml-h2o?tab=readme-ov-file#the-java-side-of-operations

java -jar pmml-h2o-example/target/pmml-h2o-example-executable-1.2-SNAPSHOT.jar --mojo-input mojo.zip --pmml-output mojo.pmml 10:45 It using the JAR pmml-h2o-example-executable-1.2-SNAPSHOT.jar

Using Intellij, I was able to generate pmml from h2o-3 mojo using their org.jpmml.h2o.example.Main.java.

wendycwong commented 1 month ago

My idea here is to add more arguments to Main.java to if a specific argument is present: --fill-missing-values, we will generate PMML file with the preprocessing enabled.

From my reading on PMML, it is very easy to add missing value replacement. You need to add it to the mining schema.

Screenshot 2024-06-03 at 7 26 17 AM

wendycwong commented 1 month ago

You can also look at the overview of variable scoping in PMML: Screenshot 2024-06-03 at 7 27 16 AM

wendycwong commented 4 weeks ago

This really can be done!!!

New info:

in GLMMojoModelBaseConverter.java, lines 86-111, it did ask for missing value treatment.

The ImputerUtil.java is the place to add missing value treatment!!! It is called in ln 103.

Inside ImputerUtil.java, line 40 is the way to add what replacement value to add if we want to deal with missing values. This will add content to the decorator field. Looks like this is all we need to do.

  1. Add a new argument to main.java, if enabled, will gather specific values to replace missing values with;
  2. May need to add a field (boolean specialMissingValueProcessing)to XGBoostMojoModelConverter to see if we need to add special missing value processing;
  3. Inside XGBoostMojoModelConverter.java: add missing value treatments that replace with special values instead of mean/mode if the specialMissingValueProcessing is set;
  4. If basically means inside the XGBoostMojoModelConverter.java, need to add the missing value treatment to the toMojoModelSchema as in GLM. However, you need to use MissingValueTreatmentMethod.AS_VALUE since you are doing something special and not using the mean or mode.
wendycwong commented 4 weeks ago

line 85 of converter.java is where the missing value treatment is added to the encoder as decorator.

schema = toMojoModelSchema(schema); // goto XGBoost....

wendycwong commented 3 weeks ago

With GLM mojo, the missing values are incorporated as a decorator:

Screenshot 2024-06-11 at 2 18 40 PM

Note the enum values are string but the numerical columns are double.

wendycwong commented 3 weeks ago

However in the .xml file, the replaced values are treated as strings no matter what column types we are looking at:

Screenshot 2024-06-11 at 2 21 54 PM

wendycwong commented 3 weeks ago

I used what @narasimhard has and made some changes so that the missing values are in the decorator: Screenshot 2024-06-11 at 2 26 02 PM

wendycwong commented 3 weeks ago

Here is the code comparison between my code and that from @narasimhard . As you can see, I basically just copied what he has and added the missing values to decorator pattern:

Screenshot 2024-06-11 at 2 52 04 PM

wendycwong commented 3 weeks ago

Instead of having a missingDict member, I just read it in and then call ImputerUtil.encodeFeature to add the missing values into the decorator pattern.

Since the output of the two .xml files are very similar, I don't have an opinion on which one to use.