jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

How to pass the result of one transformer into another in the PMMLPipeline? #87

Closed johnorillo closed 6 years ago

johnorillo commented 6 years ago

Hi,

I had created a custom transformer which takes two input columns and outputs one column. The custom function I created just simply takes the difference in terms of months of two dates ( of different formats: %Y-%m-%d , %Y/%m/%d). My sample pipeline is shown below:

column_preprocessor = DataFrameMapper([
    (["date1","date2"], [DateGap_Custom(function='DateDiff)]), 
    (["feat3","feat4"], [ContinousDomain(),]), 
])

clf = KNeighborsClassifier(n_neighbors=15)

pipeline = PMMLPipeline([
    ("mapper",column_preprocessor),
    ("classifier",clf)
])

pipeline.fit(X_train_p, y_train_p.ravel())

So far, I was able to get it working and the correct values are reflected in the training instance in the pmml exported model. On the python side when I was doing my experiment, I got a higher accuracy if I did a scaling after the DateGap() transformer.

In this regard, is there a way to pass the output of the DateGap_Custom() to RobustScaler() inside the DataFrameMapper so that robust scaling will be included in the transformation dictionary? ex hypothetical pmml xml:

    <DerivedField name="robust_scaler(DateDiff(processdate_orig, date_nlp_extractval))" optype="continuous" dataType="double">
        <Apply function="/">
            <Apply function="-">
                <FieldRef field="DateDiff(processdate_orig, date_nlp_extractval)"/>
                <Constant dataType="double">0.9931044245377176</Constant>
                                <Apply function="DateDiff">
                                      <FieldRef field="processdate_orig"/>
                                      <FieldRef field="date_nlp_extractval"/>
                                 </Apply>
            </Apply>
            <Constant dataType="double">0.022205928950840836</Constant>
        </Apply>
    </DerivedField>

And in doing so, the KNN weights will instead include the scaled output of my custom transformer DateGap().

Thank you!

johnorillo commented 6 years ago

Hi, closing this was able to solve this by including the scaler right after the column transform

pipeline = PMMLPipeline([
    ("mapper",column_preprocessor),
    ('scaler',RobustScaler()),
    ("classifier",clf)
])
vruusmann commented 6 years ago

was able to solve this by including the scaler right after the column transform

A scaler as a second step in the top-level pipeline would apply to all columns that are coming out of the first DataFrameMapper step.

You can scale the column in place. Simply append RobustScaler to the list of transformers for that column:

column_preprocessor = DataFrameMapper([
    (["date1","date2"], [DateGap_Custom(function='DateDiff'), RobustScaler()]) 
])
johnorillo commented 6 years ago

Hi, thanks for replying,

I tried appending RobustScaler just as you have suggested, at first I was getting an error saying that expected 2D array, got 1D array instead. Same thing happens if I append RobustScaler() to one of the built in transformer ex.

iris_pipeline = PMMLPipeline([ ("mapper", DataFrameMapper([ (["SepalLengthCm", "PetalLengthCm"], [Aggregator(function = "mean"), RobustScaler()]), )), ("classifier", KNeighborsClassifier(n_neighbors=15)) ])

I was able to fix the error though specifically for my custom function by making sure that my returned value is reshaped (ex. result.reshape(-1,1)) . So far everything is working fine. Thanks!