scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

VOTE SLEP018 - Pandas Output for Transformers #72

Closed thomasjpfan closed 2 years ago

thomasjpfan commented 2 years ago

This PR is for us to discuss and collect votes for SLEP018 - Pandas Output for Transformers. The current implementation is available at https://github.com/scikit-learn/scikit-learn/pull/23734. Note that this vote is for the API and the implementation can be adjusted.

According to our governance model, the vote will be open for a month (till 17th August), and the motion is accepted if 2/3 of the cast votes are in favor.

@scikit-learn/core-devs

adrinjalali commented 2 years ago

I'd say most users don't have a separate pipeline for their transform steps and another pipeline for adding the final predictor. How does a usual pipeline would look like? Should users do sth like pipeline[:-1].set_output(...) or can they call set_output on the pipeline and expect that to apply only to steps with an available transform method?

Otherwise I'm happy with the SLEP.

thomasjpfan commented 2 years ago

can they call set_output on the pipeline and expect that to apply only to steps with an available transform method?

This is the behavior I implemented and was going for. pipeline.set_output(transform="pandas") will only configure steps that can transform.

adrinjalali commented 2 years ago

Then the example in the SLEP could also mirror that to make it clear. But it's a +1 for me anyway :)

thomasjpfan commented 2 years ago

In this SLEP, I updated the pipeline example to have a classifier showcasing how set_output can be called on the whole pipeline and only the transformers are configured.

amueller commented 2 years ago

+1

jnothman commented 2 years ago

So can I clarify that Pipeline is exceptional in the sense that it is the only non-transformer that has a set_output method (and that it only affects the output of the Pipeline if either it is also a transformer, or some pipeline components behave differently with different input)?

(Do we have other non-transformers that have a transformer for a parameter, aside from TransformedTargetRegressor?)

thomasjpfan commented 2 years ago

Do we have other non-transformers that have a transformer for a parameter, aside from TransformedTargetRegressor?

GridSearchCV can define a transform method if the underlying estimator is a transformer.

Thinking about it more, the special case for Pipeline can influence other meta-estimators. For example, a VotingClassifer with many pipelines:

voting = VotingClassifier([
    ("pipe1", pipe1), ("pipe2", pipe2), ("pipe3", pipe3)
])

# If `VotingClassifier` defines a `set_output`, then the whole pipeline can be configured with:
voting.set_output(transform="pandas")

# If not, then every pipeline needs to be set individually:
voting2 = VotingClassifier([
    ("pipe1, pipe1.set_output(transform="pandas")), ...
])

For a better UX, I think all first-party meta-estimators should define a set_output and configures all their inner estimators. Specifically for a meta-estimator:

  1. If the inner estimator has a set_output, call it (the case of Pipeline).
  2. If the inner estimator is a transformer, then call set_output on it.
  3. Otherwise do nothing.
ogrisel commented 2 years ago

+1.

amueller commented 2 years ago

ping @GaelVaroquaux ;)

thomasjpfan commented 2 years ago

I am also +1 on this SLEP. Including my vote we have 12 in favor and 0 against which means this enhancement proposal is accepted. Thank you everyone for making this possible!