jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

Any plan to support rms package linear models? #15

Open wei-wu-nyc opened 7 years ago

wei-wu-nyc commented 7 years ago

Hi, Do you have plan to support the linear models that are included in rms package? More specifically, the features I am interested in are the support for transformation of restricted cubic spline fit, through rcs() function.

Thanks.

vruusmann commented 7 years ago

At a first glance, the rms::lrm() model type and the rms::rcs() function type both seem doable.

Could you provide a small R code example about what functionality do you exactly need? The example could be based on the Auto-MGP dataset (mpg ~ .).

wei-wu-nyc commented 7 years ago

Thanks for the quick reply. I will try to come up with an example when I get a chance and will let you know.

Wei

On Mar 3, 2017, at 2:47 PM, Villu Ruusmann notifications@github.com wrote:

At a first glance, the rms::lrm() model type and the rms::rcs() function type both seem doable.

Could you provide a small R code example about what functionality do you exactly need? The example could be based on the Auto-MGP https://github.com/jpmml/jpmml-r/blob/master/src/test/resources/csv/Auto.csv dataset (mpg ~ .).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jpmml/r2pmml/issues/15#issuecomment-284052300, or mute the thread https://github.com/notifications/unsubscribe-auth/AJk3ICbCdO1O8_vFivwzsU4EEQAsTMxYks5riG5fgaJpZM4MSbrV.

wei-wu-nyc commented 7 years ago

Hi Villu,

Attached is a simple script using the mpg data. The models used are:

  1. Simple linear model of mpg~weight
  2. Default rcs fit on mpg~rcs(weight)
  3. rcs fit specifying number of knots
  4. rcs fit specifying knots nodes.

Hope this helps. Let me know if I can be of further help.

Wei

On Mar 3, 2017, at 2:47 PM, Villu Ruusmann notifications@github.com wrote:

At a first glance, the rms::lrm() model type and the rms::rcs() function type both seem doable.

Could you provide a small R code example about what functionality do you exactly need? The example could be based on the Auto-MGP https://github.com/jpmml/jpmml-r/blob/master/src/test/resources/csv/Auto.csv dataset (mpg ~ .).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jpmml/r2pmml/issues/15#issuecomment-284052300, or mute the thread https://github.com/notifications/unsubscribe-auth/AJk3ICbCdO1O8_vFivwzsU4EEQAsTMxYks5riG5fgaJpZM4MSbrV.

vruusmann commented 7 years ago

@wei-wu-nyc Can't find your R script.

Perhaps GitHub "ate" it because of bad file extension? You could rename it to myscript.R.txt, and re-attach via GitHub web UI.

wei-wu-nyc commented 7 years ago

Here is the attachment. Just to be safe. I also copied the line below.

library(rms)

mpgdata=read.csv('Auto.csv')

model1=ols(mpg~weight,data=mpgdata)
model2=ols(mpg~rcs(weight), data=mpgdata)
model3=ols(mpg~rcs(weight, nk=5), data=mpgdata)
model4=ols(mpg~rcs(weight, knots=c(2000,2500,3000,3500,4000,4500,5000)), data=mpgdata)

plot(mpgdata$weight, mpgdata$mpg)
lines(mpgdata$weight, predict(model1, mpgdata), col='red')

mpgdata=mpgdata[order(mpgdata$weight),]
lines(mpgdata$weight, predict(model2, mpgdata), col='green')
lines(mpgdata$weight, predict(model3, mpgdata), col='blue')
lines(mpgdata$weight, predict(model4, mpgdata), col='yellow')

I don't know how the over-stoke got into the text. But I think you can see the text.

r2pmml_rcs_example.R.txt

vruusmann commented 7 years ago

@wei-wu-nyc Thanks for clarifying the "minimum viable product".

I hope to find time to work on this next week. GitHub should keep you notified about my progress as you've been auto-subscribed to this issue.

wei-wu-nyc commented 7 years ago

Thanks. Looking forward to testing it out. Currently, I have to save a rms model which results to a 300+MB RData file. I have to load that into R just to do predictions on new data.

One thing to note is that when I save a rms model file, all the parameters (the explicit or implicit parameters, like the automatically determined number of knots and the knot points etc.) are saved. I am not quite sure how pmml works exactly. I suggest, these parameters should be saved together in the generated pmml file.

vruusmann commented 7 years ago

Currently, I have to save a rms model which results to a 300+MB RData file.

R's lm model type and all its subtypes (such as lrm) include the full training dataset. It shouldn't affect the functionality of your model in any way if you simply nullify this attribute before RDS save. A typical lm model object shouldn't exceed 1 MB then.

I am not quite sure how pmml works exactly.

During the conversion to PMML, I need to determine which fields are passed through the rcs() function. Then I need to collect the corresponding knot counts and coefficients, and generate DerivedField elements that reproduce R's knot evaluation algorithm. Should be fully compliant with the PMML specification.

wei-wu-nyc commented 7 years ago

Thanks you for the tip on trimming the object.

Another thing to remind you about specifics on rms package is: In rms, the linear regression model is ols() and the lrm() model is the classification model "Logistic Regression Model", hence the name.

vruusmann commented 7 years ago

Spline interpolation is represented by mapping a value range to a function.

PMML 4.3 doesn't have a high-level transformation for "continuous" lookup tables. Something like that could be emulated using "if" functions, but it wouldn't be too human-friendly (as "if" functions would be very deeply nested).

I've opened a feature request with the PMML working group to discuss good long-term solutions: http://mantis.dmg.org/view.php?id=176

My suggestion is to "relax" Discretize and MapValues transformations, so that they could compute the return value dynamically. Unfortunately, the PMML working group doesn't support this suggestion, and wants to come up with a completely new transformation.

So, we're going to have to wait for a couple of weeks to get a better understanding about which way to go.

wei-wu-nyc commented 7 years ago

Thanks.

vruusmann commented 7 years ago

@wei-wu-nyc What kind of PMML consumer software would you be using? If you intend to use something that is based on the JPMML-Evaluator library, then we could go on and "relax" Discretize and MapValues transformations for the time being.

In other words, would you be happy with a "proprietary extension" that will be available in early April?

wei-wu-nyc commented 7 years ago

@vruusmann To tell you the truth, I haven't explored much on what PMML software we are going to use. For my purpose, I would like to output the pmml file and later loaded it either into R (supposedly if I use r2pmml package, I will be using JPMML-Evaluator library?), or our production system, which is on spark thus mostly written in SCALA. So I assume it should be able the load java library JPMML-Evaluator? And the main goal is to be able to call the prediction function without loading the original fitted R model object and the output results from either platforms should be identical or very close given the same inputs. If JPMML-Evaluator library is available on both platforms, that will work for me.

vruusmann commented 7 years ago

@wei-wu-nyc I was thinking that if you were going to deploy your models on some "big vendor" platform, then this proprietary extension wouldn't work (at least not within a reasonable time frame).

If you're looking to consume PMML models on Apache Spark ML, then you will be much more productive using the high-level JPMML-Spark library; it's a thin wrapper around the low-level JPMML-Evaluator library, and it takes care of plumbing Apache Spark ML and PMML systems in the best way possible. When your workflow runs on JPMML-family of libraries end-to-end, then you don't need to worry about the reproducibility of predictions - I've already taken good care of it.

wei-wu-nyc commented 7 years ago

@vruusmann I actually am not using Spark ML for various reasons. I am quant/data scientist not a programer as you probably can figure it out from the above conversation. The decision of which machine learning library/package to use is based on the model quality and feature set of various package in the model development cycle for this particular project. Not many of linear packages support spline fit options, as well as support of variable interactions. Although it is possible to populate the spline terms and using a generic glm model that does not support spline fit. There are probably many edge cases in the spline fitting need to be debugged carefully. My decision was to use rms R package for part of my models.

The truth is in the modeling stage, I don't really use Spark (we are planning to migrate to Spark even for the modeling stage for training and cross-validation etc.). Currently we only use Spark on our production and application side of the process. So I actually haven't investigated enough into Spark ML. Last time I looked at Spark ML it was relatively inefficient both in terms of memory usage and speed, compare to H2O.

This request is only a part of my whole model system. My problem is for a pretty large dataset (10s-100GB range). I divided my models into different stages. One of the stage (for a smaller data set) is using rms package to construct spline fitting models. (If it is done on the big dataset, it would run out of memory.) For the models in other stages which are trained on the whole dataset, I use H2O's h2o.glm() model. I have portability issues over there too. I was going to ask at H2O user forum about that so I didn't mentioned here. Now that it comes up, I will mention it here too.

I also have issues with portability of H2O models for prediction. They don't currently support output of pmml file. However, what they do support is the output of a POJO file for prediction function. Since our production platform is Spark. I thought it may work for my situation. One problem with this is it won't work on the R side without reloading the H2O model in R, as there is no simple way to load the POJO file into R easily.

This is the first time I deal with PMML. So I am pretty new with PMML usage and deployment. Given the overall picture of my current project, any suggestions you have for me?

Have you heard of any support of PMML for H2O models?

Thanks a lot for your help.

vruusmann commented 7 years ago

@wei-wu-nyc Thank you for explaining your data science process. This is very interesting/useful, and very difficult to obtain information for me.

It's also possible to "transpile" PMML documents into POJOs (currently, a private project). It brings considerable performance improvements, but is technically more difficult to maintain. With PMML documents, model deployment and undeployment are a matter of uploading and deleting a text file, whereas with POJOs you have all sorts of Java class loading/unloading complexities, long term storage issues (very likely, your H2O POJOs are tightly coupled to a particular H2O API version), etc. And PMML documents are much easier to parse/interpret if you want to understand (as a human) the computation that the model performs.

I think it shouldn't be too difficult to build my own H2O-to-PMML converter. It has been successfully achieved with R and Scikit-Learn, which are backed by non-Java languages/technologies. So, H2O, which is backed by Java (or at least designed to be heavily interoperable with it), should be a walk in the park.

What's the requirement behind (re-)loading external models into R? Ideologically, R and PMML use different concepts for representing transformations and models, so the "conversion event" should be regarded as a one way street ("easy to go from pig to sausage, hard to go back"). If you simply want to use external models for prediction, (eg. executing a model with a data.frame object, something like predict.pmml function), then it will be possible to invoke JPMML-Evaluator functionality straight from a running R session.

wei-wu-nyc commented 7 years ago

The main reason for me to have the ability to re-loading a model from PMML model into R is for debugging or reporting purposes. What you described "invoke JPMML-Evaluator functionality straight from a running R session", is exactly what I need. Basically I want to be able to do the predictions in R and getting the same results as the JAVA JPMML client/consumer function. So in case of some discrepancies, I can easily replicate and compare with the original models. Also as a faster loading and predicting and standalone function, when I need to do some analysis/stats/graph of the prediction output of the models, loading PMML into R may be an option.

wei-wu-nyc commented 7 years ago

@vruusmann What is the R package to use for loading a PMML model to do predictions of this PMML model in R? i.e. "invoke JPMML-Evaluator" functionality in R as you described above.

vruusmann commented 7 years ago

@wei-wu-nyc Like may other projects/tools, it's not public yet. But it's fairly easy to achieve similar functionality on your own if you deploy a local Openscoring REST web service.

In that case, you could write a small helper R function that does the following:

  1. Save input data data.frame object to a temporary CSV/TSV file.
  2. Send this temporary input CSV file (using HTTP POST method; you can use R's RCurl or httr packages for that) to Openscoring's http://localhost:8080/openscoring/model/${id}/csv endpoint. Capture its response to another temporary CSV/TSV file.
  3. Read this temporary output CSV file to results data.frame object, and return it to user.
vruusmann commented 7 years ago

@wei-wu-nyc Even simpler, you don't need to bring Openscoring REST web service into play. Simply invoke Java command-line application class org.jpmml.evaluator.EvaluationExample, which takes PMML model file, input CSV file and results CSV file as arguments:

write.csv(in_df, "input.tsv")
system2("java", c("-cp", "example-1.3-SNAPSHOT.jar", "org.jpmml.evaluator.EvaluationExample", "--model", "/path/to/model.pmml", "--input", "input.tsv", "--output", "output.tsv"))
return read.csv("output.tsv")

You can obtain this example-1.3-SNAPSHOT.jar file if you build the JPMML-Evaluator project from the source checkout. The build places it into pmml-evaluator-example/target directory; further instructions are given in the README.md file.

guleatoma commented 7 years ago

Hello,

I'm also interested in the support of the rms package, which, to me, is the best package for logistic regressions. I wanted to add a couple of things to the discussion.