Closed AshwinSekar closed 5 years ago
Is there a recommended workflow in this situation?
There was a similar situation with Apache Spark pipelines, and we managed to find some sort of fairly elegant solution there. However, Apache Spark pipelines are far more flexible than Scikit-Learn pipelines (eg. can have multiple models in a pipeline, and there can be transformers following the last model), so the solution is probably 1:1 transferable (and I really cannot recall its technical details).
Should I use the jpmml-plugin to create some sort of "pass through" estimator that returns the input?
Probably the easiest solution to your problem:
1) Create a dummy-like estimator. Could very well be a subclass of DummyRegressor
or DummyClassifier
.
2) In its #encodeModel(Schema)
method, create an empty Output
element, and append an OutputField
child element for every pre-processing step that you want to pass through. Be sure to use unique field names in order to avoid naming conflicts between DerivedField
and OutputField
elements.
Something like this should do:
<Output>
<OutputField name="z" dataType=".." optype="..">
<!-- refers to a DerivedField element whose name is "internal(y)" -->
<FieldRef field="internal(y)"/>
</OutputField>
</Output>
Re-purposed this issue. Would like to provide a solution that wouldn't require defining custom estimator types and renaming fields.
Perhaps the sklearn2pmml.pipeline.PMMLPipeline
class should have a marker attribute transformation_only
(or similar), which would inform the JPMML-SkLearn backend that the final estimator step (if any) should be skipped.
Thanks for the suggestion, I will look into creating a dummy estimator.
I noticed that the TransformationDictionary
actually has all of the transforms in my pipeline in the form of derived fields. Is there anyway I can use these derived fields to extract the transformed values? Can I apply the expression from getExpression()
in some way to the input fields?
Is there anyway I can use these derived fields to extract the transformed values?
See this comment, and the issue referenced therein: https://github.com/jpmml/jpmml-converter/issues/11#issuecomment-428587478
I understand that a
PMMLPipeline
must end with an estimator to be valid for conversion to pmml. I have use cases in which I have useful pipelines for preprocessing that I would like to convert to pmml for evaluation in Java.If I stick a
DummyClassifier
orDummyRegressor
at the end of the pipeline, it is able to be written to valid pmml, however thetarget_fields
information is lost, and I am unsure how to recover anything but the dummy prediction from the pmml.Is there a recommended workflow in this situation? Should I use the jpmml-plugin to create some sort of "pass through" estimator that returns the input?
Thanks for your help!