jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

Unexpected conversion of variable types when excluded from model building. #4

Closed warner121 closed 8 years ago

warner121 commented 8 years ago

I can write 3 random forest models in PMML format as follows:

# import libraries
library(r2pmml)
library(randomForest)

# import iris dataset
data("iris")

# build random forest model and write PMML
model.rf <- randomForest(Species ~ ., data=iris, ntree=100)
r2pmml(model.rf, "iris.rf.pmml")

# change Sepal.Width to boolean, rebuild random forest model and convert to PMML
iris$Sepal.Width <- as.logical(iris$Sepal.Width > 3)
model.rf <- randomForest(Species ~ ., data=iris, ntree=100)
r2pmml(model.rf, "iris.rf.bool.pmml")

# rebuild random forest model excluding Sepal.Length and convert to PMML
model.rf <- randomForest(Species ~ . - Sepal.Length, data=iris, ntree=100)
r2pmml(model.rf, "iris.rf.bool.exc.pmml")

However, on reviewing the results I find the exclusion of Sepal.Length from the model has transformed it in the data dictionary from double to boolean:

  > grep 'DataField name="Sepal.Length"' iris.rf.*
  iris.rf.bool.exc.pmml:        <DataField name="Sepal.Length" optype="categorical" dataType="boolean"/>
  iris.rf.bool.pmml:        <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
  iris.rf.pmml:        <DataField name="Sepal.Length" optype="continuous" dataType="double"/>

Valid datasets now fail to validate as even though this term is excluded from the model, the PMML file is expecting Sepal.Length to be a boolean, where only Sepal.Width was originally modified.

vruusmann commented 8 years ago

Class RandomForestConverter attempts to infer the data type of a field (ie. "double" vs. "boolean") based on its split values. This behavior was explained earlier today in JPMML mailing list: https://groups.google.com/d/msg/jpmml/H-fPXeOB-e8/tEclbfs5AgAJ

However, here, the problem is that "to boolean" conversion is applied to the field Sepal.Width, but appears to take effect on the field Sepal.Length instead?

warner121 commented 8 years ago

Exactly, the only conversion applied in the R code is to Sepal.Width and here we see Sepal.Length being affected also.

vruusmann commented 8 years ago

Turns out that the randomForest$terms field cannot be trusted when compiling the list of active fields for RF models that have been trained using the formula interface. One needs to deal with the randomForest$forest$xlevels field instead.

When doing Species ~ ., then both fields have four elements. However, when doing Species ~ . -Sepal.Length, then the former field has four elements, whereas the latter field has three elements.

So, this issue was really about the bad indexing of active fields. The removal of the Sepal.Length active field caused the positions of the remaining active fields to be wrongly shifted one place to the "left".