jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

Migrating XGBoost workflow from R2PMML version 0.24 to 0.26 #72

Closed upoi closed 2 years ago

upoi commented 2 years ago

Dear vruusmann,

thank you for providing the r2pmml package, I have used it a lot with logistic regression and it worked perfectly! Now Im trying to use it to encode a xgboost model that uses some categorical features as well. However the categorical features are defined as optype="continuous" and dataType="float" in the resulting pmml. With an older Version of r2pmml I had this working before, how can I achive that categoricals are declared as such in the pmml with the newer version? I have a reproducible example

Using r2pmml 0.24.0 with xgboost 0.90.0.1:

R version 3.6.0 (2019-04-26) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux

Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] dplyr_1.0.2 r2pmml_0.24.0 xgboost_0.90.0.1

library(xgboost)
library(r2pmml)
library(dplyr)
data(iris)

iris <- iris %>% mutate(Sepal.Length = if_else(Sepal.Length > 5, "high", "low"),
                        Sepal.Width = if_else(Sepal.Width > 2.2, "high", "low"),
                        Petal.Length = if_else(Petal.Length > 4, "high", "low"),
                        Petal.Width = if_else(Petal.Width > 1.8, "high", "low"),
                        Species = if_else(Species == "virginica", "versicolor", as.character(Species)))

iris <- iris %>% mutate_if(is.character, as.factor)

iris_X = iris[, -ncol(iris)]
iris_y = iris[, ncol(iris)]
iris_y = (as.integer(iris_y) - 1)
iris.matrix = model.matrix(~ . - 1, data = iris_X)
iris.DMatrix = xgb.DMatrix(iris.matrix, label = iris_y)
iris.fmap = r2pmml::genFMap(iris_X)
iris.xgboost = xgboost(data = iris.DMatrix,
                       objective = "multi:softprob", num_class = 3, nrounds = 11)
iris.xgboost = decorate(iris.xgboost, iris.fmap, 
                        response_name = "Species", response_levels = c("setosa", "versicolor", "virginica"))
r2pmml(iris.xgboost, "/some/path/pmml/iris.pmml", compact = FALSE)

produces following data dictonary in resulting iris.pmml:

<DataDictionary>
    <DataField name="Species" optype="categorical" dataType="string">
        <Value value="setosa"/>
        <Value value="versicolor"/>
        <Value value="virginica"/>
    </DataField>
    <DataField name="Sepal.Length" optype="categorical" dataType="string">
        <Value value="high"/>
        <Value value="low"/>
    </DataField>
    <DataField name="Sepal.Width" optype="categorical" dataType="string">
        <Value value="high"/>
        <Value value="low"/>
    </DataField>
</DataDictionary>

That looks good. However using r2pmml 0.26.0 and xgboost 1.4.1.1:

R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux

Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so

attached base packages: [1] stats graphics grDevices datasets utils methods base

other attached packages: [1] dplyr_1.0.7 r2pmml_0.26.0 xgboost_1.4.1.1

library(xgboost)
library(r2pmml)
library(dplyr)
data(iris)

iris <- iris %>% mutate(Sepal.Length = if_else(Sepal.Length > 5, "high", "low"),
                        Sepal.Width = if_else(Sepal.Width > 2.2, "high", "low"),
                        Petal.Length = if_else(Petal.Length > 4, "high", "low"),
                        Petal.Width = if_else(Petal.Width > 1.8, "high", "low"))
iris <- iris %>% mutate_if(is.character, as.factor)

iris_X = iris[, -ncol(iris)]
iris_y = iris[, ncol(iris)]
iris_y = (as.integer(iris_y) - 1)
iris.matrix = model.matrix(~ . - 1, data = iris_X)
iris.DMatrix = xgb.DMatrix(iris.matrix, label = iris_y)
iris.fmap = as.fmap(iris.matrix)
iris.xgboost = xgboost(data = iris.DMatrix,
                       objective = "multi:softprob", num_class = 3, nrounds = 11)
iris.xgboost = decorate(iris.xgboost, iris.fmap, 
                        response_name = "Species", response_levels = c("setosa", "versicolor", "virginica"))
r2pmml(iris.xgboost, "/some/path/iris.pmml", compact = FALSE)

produces a data dictonary where the features are defined as continous:

         <DataDictionary>
        <DataField name="Species" optype="categorical" dataType="string">
            <Value value="setosa"/>
            <Value value="versicolor"/>
            <Value value="virginica"/>
        </DataField>
        <DataField name="Sepal.Lengthhigh" optype="continuous" dataType="float"/>
        <DataField name="Sepal.Widthlow" optype="continuous" dataType="float"/>
        <DataField name="Petal.Lengthlow" optype="continuous" dataType="float"/>
        <DataField name="Petal.Widthlow" optype="continuous" dataType="float"/>
    </DataDictionary>

The only difference in the two code snippets is the creation of the fmap since in older versions there was the function genFMap which does not take a model.matrix. I tried giving as.fmap the iris_X dataframe but then there are too many features in iris.fmap since it does not recognise the reference levels of the factors. I dont see what Im missing and could not find a lot online so I wanted to ask you about that.

Thanks and BR, Lukas

vruusmann commented 2 years ago

The only difference in the two code snippets is the creation of the fmap

The R2PMML package gets feature type information from the fmap object.

Between your two examples, something must have happened to the feature map generation algorithm already on the R side. This feature map now contains invalid type information, which leads to the generation of invalid model schema.

I don't have time to experiment with parallel R versions right now, therefore I'm giving you some quick pointers that you could try out locally (and report back to here for more feedback from me).

First move, save both iris.fmap objects to a text file, and compare them line-by-line. In the 0.24 case you should have feature type as i (stands for "indicator") whereas in 0.26 you should have them as q (stands for "quantity"), right?

Try to tweak the definition of the iris.matrix object, or the r2pmml::as.fmap(x) function until these iris.fmap column types change to i for 0.26 also.

Please note that you're using R 3.6 with 0.24 and R 4.0 with 0.26. IIRC, there were some breaking changes to the matrix data structure and/or its behaviour in R 4.0. You might compare iris.matrix objects between examples as well - are they any different in terms of the data type (eg. before integer, now float)?

upoi commented 2 years ago

Here the fmaps as created above: r2pmml 0.24.0 iris.fmap image

r2pmml 0.26.0 iris.fmap image

I can create a fmap with r2pmml 0.26.0 s.t. the type of features is i iris.fmap = as.fmap(iris_X) and this results in iris.fmap image

However using this fmap to write the pmml results in Error:

> iris.fmap = as.fmap(iris_X)
> iris.xgboost = decorate(iris.xgboost, iris.fmap, 
+                         response_name = "Species", response_levels = c("setosa", "versicolor", "virginica"))
> r2pmml(iris.xgboost, "/some/path/iris.pmml", compact = FALSE)
Mar 03, 2022 1:52:38 PM org.jpmml.rexp.Main run
INFO: Parsing RDS..
Mar 03, 2022 1:52:38 PM org.jpmml.rexp.Main run
INFO: Parsed RDS in 9 ms.
Mar 03, 2022 1:52:38 PM org.jpmml.rexp.Main run
INFO: Initializing default Converter
Mar 03, 2022 1:52:38 PM org.jpmml.rexp.Main run
INFO: Initialized org.jpmml.rexp.XGBoostConverter
Mar 03, 2022 1:52:38 PM org.jpmml.rexp.Main run
INFO: Converting RDS to PMML..
Mar 03, 2022 1:52:38 PM org.jpmml.rexp.Main run
SEVERE: Failed to convert RDS to PMML
java.lang.IllegalArgumentException: Invalid 'fmap' element. Expected 5 features, got 8 features
    at org.jpmml.rexp.XGBoostConverter.checkFeatureMap(XGBoostConverter.java:249)
    at org.jpmml.rexp.XGBoostConverter.encodeSchema(XGBoostConverter.java:70)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:70)
    at org.jpmml.rexp.Converter.encodePMML(Converter.java:39)
    at org.jpmml.rexp.Main.run(Main.java:149)
    at org.jpmml.rexp.Main.main(Main.java:97)

Exception in thread "main" java.lang.IllegalArgumentException: Invalid 'fmap' element. Expected 5 features, got 8 features
    at org.jpmml.rexp.XGBoostConverter.checkFeatureMap(XGBoostConverter.java:249)
    at org.jpmml.rexp.XGBoostConverter.encodeSchema(XGBoostConverter.java:70)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:70)
    at org.jpmml.rexp.Converter.encodePMML(Converter.java:39)
    at org.jpmml.rexp.Main.run(Main.java:149)
    at org.jpmml.rexp.Main.main(Main.java:97)
Error in .convert(tempfile, file, converter, converter_classpath, verbose) : 
  The JPMML-R conversion application has failed (error code 1). The Java executable should have printed more information about the failure into its standard output and/or standard error streams

The iris.matrix is exactly the same in both situations and the type is double

vruusmann commented 2 years ago

r2pmml 0.24.0 iris.fmap vs. r2pmml 0.26.0 iris.fmap

Seeing those two images side-by-side tells me that the behaviour of the dplyr::if_else function has changed materially between R version 3.6 and 4.0. Before it was emitting 8 binary indicators, now it's emitting 5 quantities.

I'm especially concerned about the fact that in R 4.0 there are only five entries in the feature map. Why are three feature definitions (eg. Sepal.Width == "low") missing?

I can create a fmap with r2pmml 0.26.0 s.t. the type of features is i

Did you create this feature map manually, or did you fix any of your R code/my R2PMML code to make this happen?

java.lang.IllegalArgumentException: Invalid 'fmap' element. Expected 5 features, got 8 features

The feature map and the XGBoost model file are in conflict with each other (the XGBoost model knows that it was trained using five features, but the provided feature map supplies eight feature definitions).

This is a sanity check. You don't want to suppress it, because if you do, you will get a XGBoost PMML file that encodes invalid business logic.

upoi commented 2 years ago

Seeing those two images side-by-side tells me that the behaviour of the dplyr::if_else function has changed materially between R version 3.6 and 4.0. Before it was emitting 8 binary indicators, now it's emitting 5 quantities.

In r2pmml 0.24.0 I used genFMap(iris_X) and in r2pmml 0.26.0 as.fmap(iris.matrix) to generate the fmaps in the pictures. I dont know if that comes down to dplyr::if_else.

Why are three feature definitions (eg. Sepal.Width == "low") missing?

Those are omitted in the iris.matrix because we already have a binary column for Sepal.Width == "high" from which Sepal.Width == "low" is infered. The features of the xgb model (in both R/r2pmml/xgboost versions) are exactly the 5 features in the r2pmml 0.26.0 iris.fmap

Did you create this feature map manually, or did you fix any of your R code/my R2PMML code to make this happen?

No I generated it using iris.fmap = as.fmap(iris_X) instead of as.fmap(iris.matrix)

The feature map and the XGBoost model file are in conflict with each other (the XGBoost model knows that it was trained using five features, but the provided feature map supplies eight feature definitions).

Yes thats clear. But in r2pmml 0.24.0 the fmap with 8 entries can be handeled by r2pmml() without this error and resulting in the correct variable types in the pmml

What I tried now is to manually set the tpye of the fmap in r2pmml 0.26.0 to i:

> iris.fmap = as.fmap(iris.matrix)
> iris.fmap$type <- as.factor("i")
> iris.xgboost = xgboost(data = iris.DMatrix,
+                        objective = "multi:softprob", num_class = 3, nrounds = 11)
> iris.xgboost = decorate(iris.xgboost, iris.fmap, 
+                         response_name = "Species", response_levels = c("setosa", "versicolor", "virginica"))
> r2pmml(iris.xgboost, "/data/dev/user/q392698/ESA/pmml/iris.pmml", compact = FALSE)
Mar 04, 2022 11:10:16 AM org.jpmml.rexp.Main run
INFO: Parsing RDS..
Mar 04, 2022 11:10:16 AM org.jpmml.rexp.Main run
INFO: Parsed RDS in 9 ms.
Mar 04, 2022 11:10:16 AM org.jpmml.rexp.Main run
INFO: Initializing default Converter
Mar 04, 2022 11:10:16 AM org.jpmml.rexp.Main run
INFO: Initialized org.jpmml.rexp.XGBoostConverter
Mar 04, 2022 11:10:16 AM org.jpmml.rexp.Main run
INFO: Converting RDS to PMML..
Mar 04, 2022 11:10:16 AM org.jpmml.rexp.Main run
SEVERE: Failed to convert RDS to PMML
java.lang.IllegalArgumentException: Sepal.Lengthhigh
    at org.jpmml.xgboost.FeatureMap.addEntry(FeatureMap.java:123)
    at org.jpmml.rexp.XGBoostConverter.loadFeatureMap(XGBoostConverter.java:290)
    at org.jpmml.rexp.XGBoostConverter.loadFeatureMap(XGBoostConverter.java:261)
    at org.jpmml.rexp.XGBoostConverter.loadFeatureMap(XGBoostConverter.java:226)
    at org.jpmml.rexp.XGBoostConverter.ensureFeatureMap(XGBoostConverter.java:205)
    at org.jpmml.rexp.XGBoostConverter.encodeSchema(XGBoostConverter.java:67)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:70)
    at org.jpmml.rexp.Converter.encodePMML(Converter.java:39)
    at org.jpmml.rexp.Main.run(Main.java:149)
    at org.jpmml.rexp.Main.main(Main.java:97)

Exception in thread "main" java.lang.IllegalArgumentException: Sepal.Lengthhigh
    at org.jpmml.xgboost.FeatureMap.addEntry(FeatureMap.java:123)
    at org.jpmml.rexp.XGBoostConverter.loadFeatureMap(XGBoostConverter.java:290)
    at org.jpmml.rexp.XGBoostConverter.loadFeatureMap(XGBoostConverter.java:261)
    at org.jpmml.rexp.XGBoostConverter.loadFeatureMap(XGBoostConverter.java:226)
    at org.jpmml.rexp.XGBoostConverter.ensureFeatureMap(XGBoostConverter.java:205)
    at org.jpmml.rexp.XGBoostConverter.encodeSchema(XGBoostConverter.java:67)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:70)
    at org.jpmml.rexp.Converter.encodePMML(Converter.java:39)
    at org.jpmml.rexp.Main.run(Main.java:149)
    at org.jpmml.rexp.Main.main(Main.java:97)
Error in .convert(tempfile, file, converter, converter_classpath, verbose) : 
  The JPMML-R conversion application has failed (error code 1). The Java executable should have printed more information about the failure into its standard output and/or standard error streams

But that does not work either.

vruusmann commented 2 years ago

In r2pmml 0.24.0 I used genFMap(iris_X) and in r2pmml 0.26.0 as.fmap(iris.matrix) to generate the fmaps in the pictures.

OK - I completely missed this distinction, but it's a very important one.

The r2pmml::as.fmap(x) function is a generic function. It has a specialization for the data.frame type: https://github.com/jpmml/r2pmml/blob/0.26.1/R/xgboost.R#L16-L18

The data.frame data structure contains more type information (columns!) than the matrix data structure, so as.fmap(iris_X) is the suggested approach for generating feature maps.

vruusmann commented 2 years ago

OK, looks like we have something strange going on in this thread.

I will set up parallel R 3.6 and 4.0 workflows (based on the current Iris example), and investigate in detail locally. Hoping to have an explanation/workaround out in a day or two.

vruusmann commented 2 years ago

Exception in thread "main" java.lang.IllegalArgumentException: Invalid 'fmap' element. Expected 5 features, got 8 features

The XGBoost model has been trained using an xgboost.DMatrix data matrix that contains five data columns. This violates our expectation that a data matrix should have exactly eight columns (four continuous features, which have been transformed into two binary features (high/low) each).

Diagnosing the problem:

iris.matrix = model.matrix(~ . - 1, data = iris_X)
print(iris.matrix)

iris.DMatrix = xgb.DMatrix(iris.matrix, label = iris_y)
print(iris.DMatrix)

The above print out shows that the "error" already exists at the iris.matrix object stage, because it's a [150 x 5] matrix (not a [150 x 8] matrix). The five columns are named Sepal.Lengthhigh, Sepal.Lengthlow, Sepal.Widthlow, Petal.Lengthlow and Petal.Widthlow, which makes little sense.

Fixing the problem:

iris.fmap = as.fmap(iris_X)

# THIS!
iris.contrasts = lapply(iris_X[sapply(iris_X, is.factor)], contrasts, contrasts = FALSE)
iris.matrix = model.matrix(~ . - 1, data = iris_X, contrasts.arg = iris.contrasts)
print(iris.matrix)

iris.DMatrix = xgb.DMatrix(iris.matrix, label = iris_y)
print(iris.DMatrix)

If the matrix object is generated for (partially-) categorical data, then you must supply the model.matrix function call with an appropriate contrasts.arg parameter.

In the original example this parameter was missing, and the model.matrix was using the default contrasting mechanism which is appropriate for linear models (ie. fighting collinearity by automatically dropping three category levels out of eight). This default contrasting mechanism is not applicable to decision tree-based models such as XGBoost.

After deriving and supplying the iris.contrasts argument, the iris.matrix contains eight data columns.

The r2pmml::r2pmml() function call is now also successful, and the resulting PMML file contains four categorical fields as expected:

<DataDictionary>
    <DataField name="Species" optype="categorical" dataType="string">
        <Value value="setosa"/>
        <Value value="versicolor"/>
        <Value value="virginica"/>
    </DataField>
    <DataField name="Sepal.Length" optype="categorical" dataType="string">
        <Value value="high"/>
        <Value value="low"/>
    </DataField>
    <DataField name="Sepal.Width" optype="categorical" dataType="string">
        <Value value="high"/>
        <Value value="low"/>
    </DataField>
    <DataField name="Petal.Length" optype="categorical" dataType="string">
        <Value value="high"/>
        <Value value="low"/>
    </DataField>
    <DataField name="Petal.Width" optype="categorical" dataType="string">
        <Value value="high"/>
        <Value value="low"/>
    </DataField>
</DataDictionary>
vruusmann commented 2 years ago

The JPMML-XGBoost project uses R-based integration tests: https://github.com/jpmml/jpmml-xgboost/blob/1.6.2/pmml-xgboost/src/test/resources/xgboost.R

There are some more examples about setting up contrasts when working with categorical data: https://github.com/jpmml/jpmml-xgboost/blob/1.6.2/pmml-xgboost/src/test/resources/xgboost.R#L54-L58 https://github.com/jpmml/jpmml-xgboost/blob/1.6.2/pmml-xgboost/src/test/resources/xgboost.R#L101-L105 https://github.com/jpmml/jpmml-xgboost/blob/1.6.2/pmml-xgboost/src/test/resources/xgboost.R#L177-L181

Please note that in the "Audit" example, the initial matrix object is later transformed to a sparse Matrix::Matrix object in order to spare computer resources.

upoi commented 2 years ago

Thank you very much!