jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

XGBoost - Preprocessing Support #69

Open psxmc6 opened 3 years ago

psxmc6 commented 3 years ago

Hi Villu,

I would like to seek an advice regarding the best way to enrich r2pmml-generated xgboost PMML with data preprocessing steps.

As you pointed out in this thread, model formula interface can't be used in combination with xgboost model.

So far, I've been leveraging legacy pmml library to produce PMML snippets for necessary transformations (using e.g. xform_function, xform_norm_discrete), and the resulting transformation-only PMML has been then merged with model-only PMML, but ideally I would like to rely only on r2pmml package exclusively.

The aforementioned pmml package does not support all PMML built-in functions, but provides a way to define missing functions' logic in R environment so that they will be recognised when called in xform_function (see section PMML functions not supported by xform_function).

I would see the following components:

  1. adding R -> PMML mappers in the r2pmml/JPMML-R of all supported built-in functions

  2. adding some intermediate step to inject the result of applying transformations into r2pmml() function so that the converter would incorporate it in the final PMML representation

Could you elaborate on how this could be solved?

Kind regards

vruusmann commented 3 years ago

Related to #35, #36

As you pointed out in this thread, model formula interface can't be used in combination with xgboost model.

It's an XGBoost limitation.

You could emulate formula interface like this:

xgb.formula = as.formula(..)
# Tranform data.frame
Xt = apply_formula_to_data_frame(X, xgb.formula)
# Train XGBoost using the transformed data frame
xgb.model = xgb(x = Xt, y = label, ...)
# Attach formula to the model
xgb.model$formula = xgb.formula
# Convert to PMML
r2pmml(xgb.model, "xgboost.pmml")

This is the idea behind #36.

So far, I've been leveraging legacy pmml library to produce PMML snippets for necessary transformations

You must be extremely sharp/skilled. I never managed to figure out how to use the legacy pmml package for feature transformations (for integration testing purposes).

Could you elaborate on how this could be solved?

There needs to be something that both R runtime environment can execute (apply to a data frame), and that can be serialized as an RDS data format file so that the R2PMML converter can see it.

If you do free-form feature engineering in R script, then it cannot be dumped as a single R object.

However, if you do feature engineering using Tidyverse recipes, then that could be dumped in RDS data format.

vruusmann commented 3 years ago

@psxmc6 If the conversion to PMML weren't a problem, then how would you do feature engineering for R? Which package, which functions (for continuous and categorical features)?

The only "limitation" is that the solution must be dumpable into a file in RDS data format, and when loaded back into a clean R environment from the RDS file, must be "complete" - should be executable without much R scripting effort.

psxmc6 commented 3 years ago

So I don't have anything specific in mind, and please correct me if I am wrong but in the end, we are limited to what can be used by the list of PMML built-in functions? I found that substring, replace, isIn, matches, if, and, or allow you to express quite a broad range of transformations and these will likely be available in many different packages.

vruusmann commented 3 years ago

.. but in the end, we are limited to what can be used by the list of PMML built-in functions?

Not exactly.

PMML has three functionality/markup layers:

  1. Model
  2. Transformation
  3. Function

We should focus on the middle layer, which are elements dedicated to representing feature transformations (the classification is based on operational type):

Only when you cannot solve your problem in the middle layer using the above four elements, you shall fall to the lowest level and start using PMML built-in functions using the Apply element.

vruusmann commented 3 years ago

@psxmc6 Challenge rephrased - if you're starting with a raw dataset, and need to perform these middle-level transformations on your data (before sending it to XGBoost), how would you do it in the R language?

For example, you want to bin a continuous feature to categorical. The R2PMML package currently supports base::cut()function via the formula interface. But as we know, the XGBoost package does not have formula support. What's the alternative?