jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

Raise a proper error about improper glm() target column type #67

Closed arodionoff closed 3 years ago

arodionoff commented 3 years ago

I use Rstudio Could with R 4.0.3, but but execution of your code about {r2pmml}

install.packages("readr")
install.packages("r2pmml")
library(readr)
audit.df <- readr::read_csv("https://raw.githubusercontent.com/vruusmann/blog/gh-pages/assets/data/audit.csv")
audit.terms = c("Adjusted ~ .")
# Feature engineering
audit.formula = as.formula(paste(audit.terms, collapse = " "))
audit.glm = glm(formula = audit.formula, family = binomial(link = "logit"), data = audit.df)
library("r2pmml")
r2pmml::r2pmml(audit.glm, "RExpAudit.pmml")

was interrupted with an error:

SEVERE: Failed to convert
java.lang.IllegalArgumentException: Invalid 'Adjusted' element. Expected integer, got numeric
    at org.jpmml.rexp.RGenericVector.getVectorElement(RGenericVector.java:127)
    at org.jpmml.rexp.RGenericVector.getIntegerElement(RGenericVector.java:94)
    at org.jpmml.rexp.RGenericVector.getFactorElement(RGenericVector.java:80)
    at org.jpmml.rexp.RGenericVector.getFactorElement(RGenericVector.java:76)
    at org.jpmml.rexp.GLMConverter.encodeSchema(GLMConverter.java:58)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:69)
    at org.jpmml.rexp.Converter.encodePMML(Converter.java:39)
    at org.jpmml.rexp.Main.run(Main.java:149)
    at org.jpmml.rexp.Main.main(Main.java:97)

Exception in thread "main" java.lang.IllegalArgumentException: Invalid 'Adjusted' element. Expected integer, got numeric
    at org.jpmml.rexp.RGenericVector.getVectorElement(RGenericVector.java:127)
    at org.jpmml.rexp.RGenericVector.getIntegerElement(RGenericVector.java:94)
    at org.jpmml.rexp.RGenericVector.getFactorElement(RGenericVector.java:80)
    at org.jpmml.rexp.RGenericVector.getFactorElement(RGenericVector.java:76)
    at org.jpmml.rexp.GLMConverter.encodeSchema(GLMConverter.java:58)
    at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:69)
    at org.jpmml.rexp.Converter.encodePMML(Converter.java:39)
    at org.jpmml.rexp.Main.run(Main.java:149)
    at org.jpmml.rexp.Main.main(Main.java:97)
Error in .convert(tempfile, file, converter, converter_classpath, verbose) : 
  The JPMML-R conversion application has failed (error code 1). The Java executable should have printed more information about the failure into its standard output and/or standard error streams

The check showed that it is caused by the presence of "family = binomial (link ="logit")"

vruusmann commented 3 years ago

java.lang.IllegalArgumentException: Invalid 'Adjusted' element. Expected integer, got numeric

You can't train a binary classification model with a continuous label!

The Audit$Adjusted must be converted from integer to factor:

audit.df = read.csv("audit.csv")
audit.df$Adjusted = as.factor(audit.df$Adjusted)

I use Rstudio Could with R 4.0.3, but but execution of your code about {r2pmml}

My code does the above type conversion correctly. Yours doesn't.

vruusmann commented 3 years ago

Keeping this issue open - there needs to be a proper error message pointing out the improper glm() function usage.

arodionoff commented 3 years ago

Thanks a lot for the clarification about audit.df$Adjusted = as.factor(audit.df$Adjusted)

It would be very nice if your page also made these changes regarding the type of the outcome.

vruusmann commented 3 years ago

It would be very nice if your page also made these changes regarding the type of the outcome.

My page links to the train.R file, which does proper data loading.

The glm() function itself will issue a warning when the binary classification task is run against a continuous label. This is a beginner-level stuff that is completely out of the scope of PMML-oriented technical articles.

Next time, pay attention to the output of your own script first!

arodionoff commented 3 years ago

Thanks for the link, but it would be nice to include the code audit.df$Adjusted = as.factor(audit.df$Adjusted) on your page.

By the way, do you have an example of using nested pre-processing functions for handling missing values and discretizing a continuous variable for your package {r2pmml}?

For example, for a classic case:

library('smbinning')        # Scoring Modeling and Optimal Binning

data(smbsimdf1, package = 'smbinning')
result <- smbinning::smbinning(df = smbsimdf1, y="fgood", x="cbs1") # Run and save result

result$bands # Bins or bands
bands <- result$bands
cuts <- result$cuts

smbsimdf1$fgood  = as.factor(smbsimdf1$fgood)
smb.formula = as.formula( fgood ~ ifelse( is.na(cbs1), 'Missing', cut(x = cbs1, breaks = bands) ) )

( smb.glm <- glm( formula = smb.formula, family =  binomial(link = "logit"), data = smbsimdf1 ) )
r2pmml::r2pmml( smb.glm, 'smbsimdf1.xml')

So far, this case does not work. Although it is possible to write such code in PMML by hand.

vruusmann commented 3 years ago

it would be nice to include the code audit.df$Adjusted = as.factor(audit.df$Adjusted) on your page.

My page is about PMML. I've provided a complete R training script in the resources section, which does everything correctly. If you choose to ignore my R training script and do everything your own way, then that's none of my business anymore.

do you have an example of using nested pre-processing functions for handling missing values and discretizing a continuous variable for your package

The JPMML-R conversion library can see only this R feature engineering that is directly attached to the smb.glm object.

Right now, all your "smbinning work" is happening in a separate code block, and the result object is not reachable from the smb.glm object in any way.

Also, if you store the smb.glm object using the saveRDS() function into a file, and load it back into a fresh R session using the loadRDS() function, you won't be able to use it for prediction for exactly the same reason - the "smbinning work" is missing/not defined.

When using Scikit-Learn we don't have this problem, because the model step and all its prerequisite feature engineering steps are contained in a single pipeline object (which can be saved and loaded atomically). The R paradigm is painfully lacking in this space (and it's not (R2)PMML's fault).