nubank / fklearn

fklearn: Functional Machine Learning
Apache License 2.0
1.51k stars 165 forks source link

Pipelines can't be nested as any other learner #199

Open peguerosdc opened 2 years ago

peguerosdc commented 2 years ago

Code sample

This script is complete, it should run "as is"

from fklearn.training.regression import linear_regression_learner
from fklearn.training.pipeline import build_pipeline
from fklearn.training.imputation import imputer

# This example will try to create a linear regressor...
def load_sample_data():
    import pandas as pd
    from sklearn import datasets
    x,y = datasets.load_diabetes(as_frame=True, return_X_y=True)
    return pd.concat([x, y], axis=1)

# ...but pipelines can't be used as individual learners
nested_learner = build_pipeline(
    build_pipeline(
        imputer(columns_to_impute=["age"], impute_strategy='median'),
    ),
    linear_regression_learner(features=["age"], target=["target"])
)
fn, df_result, logs = nested_learner(load_sample_data())
'''
Output:
ValueError: Predict function of learner pipeline contains variable length arguments: kwargs
Make sure no predict function uses arguments like *args or **kwargs.
'''

Problem description

This is a problem because (taken from the docs) "pipelines (should) behave exactly as individual learner functions". That is, pipelines should be consistent with L in SOLID, but that is not happening.

The main benefit of supporting nested pipelines is that you can produce more maintainable code as packing complex operations in one big step like:

def custom_learner(feature: str, target: str):
    return build_pipeline(
        imputer(columns_to_impute=[feature], impute_strategy='median'),
        linear_regression_learner(features=[feature], target=[target])
    )

Is cleaner and more readable.

Expected behavior

The "Code sample" should produce the same output as if nested_learner was defined as:

nested_learner = build_pipeline(
    imputer(columns_to_impute=["age"], impute_strategy='median'),
    linear_regression_learner(features=["age"], target=["target"])
)

That is, if a pipeline is a type of learner, it should be possible to put it in place of any other learner.

Possible solutions

A workaround is proposed in #145 , but it only works if you already have a DataFrame (which doesn't happen in this scenario). Would like to hear if someone has investigated this or knows what we need to change to support this, but in the meantime I propose adding a note to the docs to prevent someone else from trying to do this.

jmoralez commented 2 years ago

Hi. I think this would be useful as well when you want to split data-preprocessing from the model. From what I understand the problem is that the predict_fn of the pipeline takes kwargs, determines where that kwarg belongs and assigns it to the corresponding function. So if you have two pipelines I think it gets hard to determine which pipeline the kwarg belongs to.

One possible solution I see to this is taking the approach that sklearn takes when passing parameters to the estimators in a pipeline:

Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p. source

So I think it would be possible to have nested pipelines and then calling the predict_fn with kwargs using this kind of prefixing, i.e. if we have the following:

pipe1 = build_pipeline([f(a=1), g(a=2)])
pipe2 = build_pipeline([f(a=2), g(a=3)])
pipe3 = build_pipeline([pipe1, pipe2])
predict_fn, data, logs = pipe3(df)
# the following would set the argument a to 3 in the f function of the first pipeline
predict_fn(df2, pipeline0__f__a=3)
# the following would set the argument a to 2 in the g function of the second pipeline
predict_fn(df2, pipeline1__g__a=2)

We would of course have the possibility of setting custom names to each pipeline and function so the argument could be like preprocessing__scaler__column='something'

WDYT?