Add support for `stats::formula` objects

schlichtanders commented 6 years ago

at the package's README https://github.com/jpmml/r2pmml#model-formulae it says that one can use nice R syntax to define normal arithmetic processing of the data when using GLM or so

Are they also supported independently of LM/GLM, I mean to create simple models, just involving simple arithmetics.

If possible, can you provide example code? If not, can it be supported in general?

vruusmann commented 6 years ago

In R there are two approaches for declaring label and features:

"Matrix interface": model(x = features, y = label)
"Formula interface": model(label ~ features, data = data)

Arithmetic operations are supported only with "formula interface", because this way they become part of the model object (eg. can be serialized/deserialized in RDS data format). However, the support for "formula interface" varies considerably between R packages - it is best supported by several built-in packages (eg. the base package, which provides glm() and lm() functions), reasonably supported by several others (eg. earth and randomForest packages), and not at all supported by many more.

You need to check the documentation of your target R package/function if it supports the "formula interface" or not.

If possible, can you provide example code?

See the following presentation: https://www.slideshare.net/VilluRuusmann/converting-r-to-pmml-82182483

There are many in-formula feature engineering examples starting from slide 13.

schlichtanders commented 6 years ago

Thanks a lot for the many explanations, comments and link. That is great

Looking over it you have many examples with "as.formular". These are exactly the things which I would like to have WITHOUT wrapping it into a linear model or else. Just straight these formulars. No special R package. That is really not possible?

I hoped for something like a plain "model" function given by r2pmml which is kind of an identity wrapper around the formula or something

On Tue, 5 Dec 2017, 19:15 Villu Ruusmann, notifications@github.com wrote:

In R there are two approaches for declaring label and features:

"Matrix interface": model(x = features, y = label)

"Formula interface": model(label ~ features, data = data)

Arithmetic operations are supported only with "formula interface", because this way they become part of the model object (eg. can be serialized/deserialized in RDS data format). However, the support for "formula interface" varies considerably between R packages - it is best supported by several built-in packages (eg. the base package, which provides glm() and lm() functions), reasonably supported by several others (eg. earth and randomForest packages), and not at all supported by many more.

You need to check the documentation of your target R package/function if it supports the "formula interface" or not.

If possible, can you provide example code?

See the following presentation: https://www.slideshare.net/VilluRuusmann/converting-r-to-pmml-82182483

There are many in-formula feature engineering examples starting from slide 13.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jpmml/r2pmml/issues/35#issuecomment-349392446, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDu-ECsdZ3a3PVK6JrDuZX_HOMx2wacks5s9YhAgaJpZM4Q2ru- .

vruusmann commented 6 years ago

These are exactly the things which I would like to have WITHOUT wrapping it into a linear model

You mean taking a stats::formula object, and converting it into a PMML fragment?

formula = as.formula(...)
r2pmml(formula, "formula.pmml")

What will happen to those PMML fragments afterwards? Want to copy-paste them manually to someplace else?

The PMML thinking is that formula objects cannot exist in isolation. They have to be associated with a model object or, alternatively, be converted to some-sort of function definition (typically a DerivedField element).

However, it would be possible to teach the r2pmml package to take notice of stats::formula objects, and emit a partial result in this case (ie. the results wouldn't be a complete PMML document, but a fragment of it).

schlichtanders commented 6 years ago

thank you very much for the explanations and for paraphrasing my thoughts. Now I really feel understood.

Thanks a lot!

vruusmann commented 6 years ago

Suppose you create a stats::formula object like this:

#library("r2pmml")
formula = as.formula(y ~ I(x1 + x2))
#r2pmml(formula, "formula.pmml")

A formula object could be translated to a singleton DerivedField element. However, this element cannot exist in isolation, there must be accompanying DataField elements that define its input and output fields (names, data and operational types, etc).

A corresponding PMML fragment might look like this:

<PMML>
  <DataDictionary>
    <DataField name="x1" dataType="double" optype="continuous"/>
    <DataField name="x2" dataType="double" optype="continuous"/>
  </DataDictionary>
  <TransformationDictionary>
    <DerivedField name="y" dataType="double" optype="continuous">
      <Apply function="+">
        <FieldRef field="x1"/>
        <FieldRef field="x2"/>
      </Apply>
    </DerivedField>
  </TransformationDictionary>
</PMML>

This kind of "partial conversion" can be very helpful if you're trying to convert a piece of R (or Python) code into PMML. It will be very easy to copy the above DataField and DerivedField elements and paste them into some other PMML document (that needs to be enhanced with more feature engineering logic).

jpmml / r2pmml

Add support for `stats::formula` objects #35