jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

FunctionTransformer usage #40

Closed gediminaszylius closed 5 years ago

gediminaszylius commented 7 years ago

I want to know limitations of using FunctionTransformer in scikit-learn in order to create valid PMML pipeline. I have no experience in java so it is difficult to figure out for myself it's limitations, but I see in https://github.com/jpmml/jpmml-sklearn/blob/c0a3414c7486e880edf1aa79d4fd4b70d346cd2e/src/main/java/sklearn/preprocessing/FunctionTransformer.java that it supports various functions.

But my question is, how about combinations of those functions and simple +,-,*,/ math operations (e.g. np.log(np.ceil(x/100)+1) ), is there any possibility to create similar static expressions via FunctionTransformer?

If not, is there a way to do it in python and compile to valid PMML model using this library?

vruusmann commented 7 years ago

Currently you can use only 1-parameter NumPy ufuncs: https://docs.scipy.org/doc/numpy-1.12.0/reference/ufuncs.html

Here's a technical explanation for this limitation: https://groups.google.com/forum/#!topic/jpmml/MQ9ZA_6Xgt0

Free-form mathematical expressions are doable in many ways. In long term, the JPMML-SkLearn library should provide a Python expression parser/translator (the JPMML-R library already has such thing for R expressions). In short term, you could try developing and deploying JPMML-SkLearn plugins using the SkLearn2PMML-Plugin approach. Unfortunately, it requires one to be fairly proficient with the Java/JVM platform.

Leaving this issue open now.

gediminaszylius commented 7 years ago

Understood, so are there any ways now using current library version without creating custom plugins to: 1) combine model outputs internally like following: clf1clf2+clf3clf4; 2) add some rule to activate particular model if some input is present (simple predicate) or ,generally, add if-like statements that could activate part of pipelining given particular input;

vruusmann commented 7 years ago

How would you implement custom model ensembles in a Python script? Some Scikit-Learn class that I haven't heard of yet, or some 3rd party Python library?

Sure, you can implement both custom model and transformer types using the SkLearn2PMML-Plugin approach. The goal is to generate a segmentation-type MiningModel element, where the activation criteria for each member segment are defined by a specific predicate:

<MiningModel>
  <Segmentation>
    <Segment id="first">
      <SimplePredicate field="my_controller_field" operator="isNotMissing"/>
    </Segment>
    <Segment id="second">
      <SimplePredicate field="my_controller_field" operator="isMissing"/>
    </Segment>
  </Segmentation>
</MiningModel>
gediminaszylius commented 7 years ago

Yea, this might be bit off-topic, because it is out of scikit-learn. I build 4 different models each with different scikit-learn pipeline and I want to combine them all in one PMML file, but not in scikit-learn supported way, with some general if statements and expressions (which are simple, but out of scikit-learn object supported namespaces)

I dont use any ensembling library, just want to include heterogenous pipelines/models inside one PMML and control when to use what model/pipeline (for each input separately in full batch set) by adding specific input feature that controls it (try to implement some custom logic inside PMML as much as possible) in order to create backend-independent model. I'm now also investigating PMML standard in order to build custom PMML model enricher (in Python), I see my need should be related to MiningModel node as you mentioned too.

I don't know java, but I see that raw PMML file can be modified, will try to do it in ugly way using Python alone.

The idea is the same as ensembling, to create additional segments that are activated/deactivated by simple predicate (special input feature) and are multiplied/added by other segments (custom formula, weighted average and voting classifier could be assumed as a special cases).

vruusmann commented 5 years ago

Closing this issue, as it has been resolved in a piecewise manner over the time.

First, it is possible to use Numpy UFuncs inside sklearn2pmml.preprocessing.ExpressionTransformer.

Second, it is possible to build conditionally executed model ensembles using sklearn2pmml.ensemble.SelectFirstEstimator.