jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Support for transformer-only pipelines #86

Closed AshwinSekar closed 5 years ago

AshwinSekar commented 6 years ago

I understand that a PMMLPipeline must end with an estimator to be valid for conversion to pmml. I have use cases in which I have useful pipelines for preprocessing that I would like to convert to pmml for evaluation in Java.

If I stick a DummyClassifier or DummyRegressor at the end of the pipeline, it is able to be written to valid pmml, however the target_fields information is lost, and I am unsure how to recover anything but the dummy prediction from the pmml.

Is there a recommended workflow in this situation? Should I use the jpmml-plugin to create some sort of "pass through" estimator that returns the input?

Thanks for your help!

vruusmann commented 6 years ago

Is there a recommended workflow in this situation?

There was a similar situation with Apache Spark pipelines, and we managed to find some sort of fairly elegant solution there. However, Apache Spark pipelines are far more flexible than Scikit-Learn pipelines (eg. can have multiple models in a pipeline, and there can be transformers following the last model), so the solution is probably 1:1 transferable (and I really cannot recall its technical details).

Should I use the jpmml-plugin to create some sort of "pass through" estimator that returns the input?

Probably the easiest solution to your problem: 1) Create a dummy-like estimator. Could very well be a subclass of DummyRegressor or DummyClassifier. 2) In its #encodeModel(Schema) method, create an empty Output element, and append an OutputField child element for every pre-processing step that you want to pass through. Be sure to use unique field names in order to avoid naming conflicts between DerivedField and OutputField elements.

Something like this should do:

<Output>
  <OutputField name="z" dataType=".." optype="..">
    <!-- refers to a DerivedField element whose name is "internal(y)" -->
    <FieldRef field="internal(y)"/>
  </OutputField>
</Output>
vruusmann commented 6 years ago

Re-purposed this issue. Would like to provide a solution that wouldn't require defining custom estimator types and renaming fields.

Perhaps the sklearn2pmml.pipeline.PMMLPipeline class should have a marker attribute transformation_only (or similar), which would inform the JPMML-SkLearn backend that the final estimator step (if any) should be skipped.

AshwinSekar commented 6 years ago

Thanks for the suggestion, I will look into creating a dummy estimator.

I noticed that the TransformationDictionary actually has all of the transforms in my pipeline in the form of derived fields. Is there anyway I can use these derived fields to extract the transformed values? Can I apply the expression from getExpression() in some way to the input fields?

vruusmann commented 6 years ago

Is there anyway I can use these derived fields to extract the transformed values?

See this comment, and the issue referenced therein: https://github.com/jpmml/jpmml-converter/issues/11#issuecomment-428587478