Closed thomasjpfan closed 2 years ago
I'd say most users don't have a separate pipeline for their transform steps and another pipeline for adding the final predictor. How does a usual pipeline would look like? Should users do sth like pipeline[:-1].set_output(...)
or can they call set_output
on the pipeline and expect that to apply only to steps with an available transform
method?
Otherwise I'm happy with the SLEP.
can they call set_output on the pipeline and expect that to apply only to steps with an available transform method?
This is the behavior I implemented and was going for. pipeline.set_output(transform="pandas")
will only configure steps that can transform.
Then the example in the SLEP could also mirror that to make it clear. But it's a +1 for me anyway :)
In this SLEP, I updated the pipeline example to have a classifier showcasing how set_output
can be called on the whole pipeline and only the transformers are configured.
+1
So can I clarify that Pipeline
is exceptional in the sense that it is the only non-transformer that has a set_output
method (and that it only affects the output of the Pipeline if either it is also a transformer, or some pipeline components behave differently with different input)?
(Do we have other non-transformers that have a transformer for a parameter, aside from TransformedTargetRegressor?)
Do we have other non-transformers that have a transformer for a parameter, aside from TransformedTargetRegressor?
GridSearchCV
can define a transform method if the underlying estimator is a transformer.
Thinking about it more, the special case for Pipeline can influence other meta-estimators. For example, a VotingClassifer
with many pipelines:
voting = VotingClassifier([
("pipe1", pipe1), ("pipe2", pipe2), ("pipe3", pipe3)
])
# If `VotingClassifier` defines a `set_output`, then the whole pipeline can be configured with:
voting.set_output(transform="pandas")
# If not, then every pipeline needs to be set individually:
voting2 = VotingClassifier([
("pipe1, pipe1.set_output(transform="pandas")), ...
])
For a better UX, I think all first-party meta-estimators should define a set_output
and configures all their inner estimators. Specifically for a meta-estimator:
set_output
, call it (the case of Pipeline
).set_output
on it.+1.
ping @GaelVaroquaux ;)
I am also +1 on this SLEP. Including my vote we have 12 in favor and 0 against which means this enhancement proposal is accepted. Thank you everyone for making this possible!
This PR is for us to discuss and collect votes for SLEP018 - Pandas Output for Transformers. The current implementation is available at https://github.com/scikit-learn/scikit-learn/pull/23734. Note that this vote is for the API and the implementation can be adjusted.
According to our governance model, the vote will be open for a month (till 17th August), and the motion is accepted if 2/3 of the cast votes are in favor.
@scikit-learn/core-devs