Closed arodionoff closed 3 years ago
java.lang.IllegalArgumentException: Invalid 'Adjusted' element. Expected integer, got numeric
You can't train a binary classification model with a continuous label!
The Audit$Adjusted
must be converted from integer to factor:
audit.df = read.csv("audit.csv")
audit.df$Adjusted = as.factor(audit.df$Adjusted)
I use Rstudio Could with R 4.0.3, but but execution of your code about {r2pmml}
My code does the above type conversion correctly. Yours doesn't.
Keeping this issue open - there needs to be a proper error message pointing out the improper glm()
function usage.
Thanks a lot for the clarification about
audit.df$Adjusted = as.factor(audit.df$Adjusted)
It would be very nice if your page also made these changes regarding the type of the outcome.
It would be very nice if your page also made these changes regarding the type of the outcome.
My page links to the train.R
file, which does proper data loading.
The glm()
function itself will issue a warning when the binary classification task is run against a continuous label. This is a beginner-level stuff that is completely out of the scope of PMML-oriented technical articles.
Next time, pay attention to the output of your own script first!
Thanks for the link, but it would be nice to include the code audit.df$Adjusted = as.factor(audit.df$Adjusted)
on your page.
By the way, do you have an example of using nested pre-processing functions for handling missing values and discretizing a continuous variable for your package {r2pmml}?
For example, for a classic case:
library('smbinning') # Scoring Modeling and Optimal Binning
data(smbsimdf1, package = 'smbinning')
result <- smbinning::smbinning(df = smbsimdf1, y="fgood", x="cbs1") # Run and save result
result$bands # Bins or bands
bands <- result$bands
cuts <- result$cuts
smbsimdf1$fgood = as.factor(smbsimdf1$fgood)
smb.formula = as.formula( fgood ~ ifelse( is.na(cbs1), 'Missing', cut(x = cbs1, breaks = bands) ) )
( smb.glm <- glm( formula = smb.formula, family = binomial(link = "logit"), data = smbsimdf1 ) )
r2pmml::r2pmml( smb.glm, 'smbsimdf1.xml')
So far, this case does not work. Although it is possible to write such code in PMML by hand.
it would be nice to include the code audit.df$Adjusted = as.factor(audit.df$Adjusted) on your page.
My page is about PMML. I've provided a complete R training script in the resources section, which does everything correctly. If you choose to ignore my R training script and do everything your own way, then that's none of my business anymore.
do you have an example of using nested pre-processing functions for handling missing values and discretizing a continuous variable for your package
The JPMML-R conversion library can see only this R feature engineering that is directly attached to the smb.glm
object.
Right now, all your "smbinning work" is happening in a separate code block, and the result
object is not reachable from the smb.glm
object in any way.
Also, if you store the smb.glm
object using the saveRDS()
function into a file, and load it back into a fresh R session using the loadRDS()
function, you won't be able to use it for prediction for exactly the same reason - the "smbinning work" is missing/not defined.
When using Scikit-Learn we don't have this problem, because the model step and all its prerequisite feature engineering steps are contained in a single pipeline object (which can be saved and loaded atomically). The R paradigm is painfully lacking in this space (and it's not (R2)PMML's fault).
I use Rstudio Could with R 4.0.3, but but execution of your code about {r2pmml}
was interrupted with an error:
The check showed that it is caused by the presence of "family = binomial (link ="logit")"