Open peguerosdc opened 2 years ago
Hi. I think this would be useful as well when you want to split data-preprocessing from the model. From what I understand the problem is that the predict_fn of the pipeline takes kwargs, determines where that kwarg belongs and assigns it to the corresponding function. So if you have two pipelines I think it gets hard to determine which pipeline the kwarg belongs to.
One possible solution I see to this is taking the approach that sklearn takes when passing parameters to the estimators in a pipeline:
Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p. source
So I think it would be possible to have nested pipelines and then calling the predict_fn with kwargs using this kind of prefixing, i.e. if we have the following:
pipe1 = build_pipeline([f(a=1), g(a=2)])
pipe2 = build_pipeline([f(a=2), g(a=3)])
pipe3 = build_pipeline([pipe1, pipe2])
predict_fn, data, logs = pipe3(df)
# the following would set the argument a to 3 in the f function of the first pipeline
predict_fn(df2, pipeline0__f__a=3)
# the following would set the argument a to 2 in the g function of the second pipeline
predict_fn(df2, pipeline1__g__a=2)
We would of course have the possibility of setting custom names to each pipeline and function so the argument could be like preprocessing__scaler__column='something'
WDYT?
Code sample
This script is complete, it should run "as is"
Problem description
This is a problem because (taken from the docs) "pipelines (should) behave exactly as individual learner functions". That is, pipelines should be consistent with L in SOLID, but that is not happening.
The main benefit of supporting nested pipelines is that you can produce more maintainable code as packing complex operations in one big step like:
Is cleaner and more readable.
Expected behavior
The "Code sample" should produce the same output as if
nested_learner
was defined as:That is, if a pipeline is a type of learner, it should be possible to put it in place of any other learner.
Possible solutions
A workaround is proposed in #145 , but it only works if you already have a
DataFrame
(which doesn't happen in this scenario). Would like to hear if someone has investigated this or knows what we need to change to support this, but in the meantime I propose adding a note to the docs to prevent someone else from trying to do this.