scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

SLEP018 Pandas output for transformers with set_output #68

Closed thomasjpfan closed 2 years ago

thomasjpfan commented 2 years ago

This SLEP proposes a set_output API used to configure the output of transform. The overall idea is to use set_output(transform="pandas_or_sparse"), which outputs a pandas dataframe for dense data and a scipy sparse matrix for sparse data.

Use cases

I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:

Future Extensions

The Pandas DataFrame is not suitable to provide column names for sparse data because it has performance issues as shown in #16772. A future extension to this SLEP is to have a "pandas_or_namedsparse" option. This option will use a scikit-learn specific sparse container that subclasses SciPy's sparse matrices. This sparse container includes the sparse data, feature names and index. This enables pipelines with Vectorizers without performance issues::

pipe = make_pipeline(
   CountVectorizer(),
   TfidfTransformer(),
   LogisticRegression(solver="liblinear")
)
pipe.set_output(transform="pandas_or_namedsparse")

# feature names for logistic regression
pipe[-1].feature_names_in_

CC @amueller @glemaitre @lorentzenchr

ogrisel commented 2 years ago

If we introduce new (internal) containers to add named columns to sparse matrices, we could even have a container that has several column blocks of different types (e.g. pandas dataframe and sparse matrices, possibly with different dtypes), for instance to store the output of a column transformer without any a priori data conversion and this would allow the downstream estimator in the pipeline to materialize this input in an optimal way.

Some column wise estimators such as coordinate-descent linear models and tree models could even accept data with mixed column blocks representation.

thomasjpfan commented 2 years ago

I think it's possible to have a new custom container that contains several blocks. It can also adopt the dataframe interchange protocol which allows the custom container to be converted to dataframes in other libraries.

Dataframe libraries such as: https://github.com/rapidsai/cudf/pull/90710, https://github.com/pandas-dev/pandas/pull/46141, https://github.com/vaexio/vaex/pull/1509, https://github.com/modin-project/modin/pull/4269 have adopted the protocol.

There are two underlying issues with the protocol:

  1. Sparse columns as you noted in https://github.com/data-apis/dataframe-api/issues/55
  2. Custom dataframe container backed by a C-ordered ndarray. According to the specification, the column buffers must be contiguous. XREF: https://github.com/data-apis/dataframe-api/issues/39

In both cases, we can still store the blocks as CSR and C-ordered ndarrays, but when __dataframe__ is called a copy must be made. This is exactly what pandas does in it's dataframe protocol implementation.

This SLEP

As noted in the dev meeting, this SLEP will focus on set_output and pandas output. If we were to create a custom container, then we can use the same API of set_output to configure the transformer.

thomasjpfan commented 2 years ago

During the monthly developer meeting we decided to remove the "namedtensor" from this SLEP and stick with transform="pandas" for now. I am debating between two APIs for sparse data:

  1. transform="pandas", which will use Panda's Sparse extension arrays for sparse data, which is inefficient. The pro is that the feature names are passed through.
  2. transform="pandas_or_sparse", which will output a SciPy sparse matrix for sparse data, which is more efficient. The con is that the feature names are not passed through.

The final solution is to have transform="pandas_or_namedsparse" where we have a custom sparse container with column names, which was the initial version of this SLEP. But since we want to reduce the scope, I am okay with option 1, 2 or both.

adrinjalali commented 2 years ago

An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation, and only add it as an improvement later with a separate discussion thread.

thomasjpfan commented 2 years ago

An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation

I interpreted "not support sparse" to mean "not support sparse with feature names". Transformers will still be returning SciPy sparse data.

From an API point of view, setting set_output="pandas" but returning a SciPy sparse matrix feels a little counterintuitive. This lead me down the two paths in https://github.com/scikit-learn/enhancement_proposals/pull/68#issuecomment-1153303691. set_output="pandas_or_sparse" means dense data becomes a DataFrame and SciPy sparse remains the same.

To move this forward, I updated the SLEP to say that set_output="pandas" only configures the output for dense data. In the future, we can extend it to configure sparse data using set_output="pandas_or_namedsparse", which is backward compatible.

The API will look a little strange if a user acutally wants the Pandas Sparse DataFrame. In that case, set_output="pandas_or_pandas" would configure both dense and sparse to return a pandas DataFrame.

thomasjpfan commented 2 years ago

Thinking it over, the remaining option is to error when set_output="pandas" and the output is sparse. When it comes to expanding the API, we do not have to deal with the behavior described in https://github.com/scikit-learn/enhancement_proposals/pull/68#issuecomment-1154609993

I updated the SLEP to error if set_output="pandas" and the output is sparse.

lorentzenchr commented 2 years ago

I would like to go through an experimental period

+1. For me, it's fine to implement it this way without having it mentioned in the SLEP. However, if it helps to get agreement and voting faster, we can add it. In general, I find it a good idea to introduce great new features as experimental to have the chance to get them in, get experience with them, have some learnings and improve.

jnothman commented 2 years ago

This this is the first time we introduce new public methods to set state on estimators (beyond set_params)

What about SLEP006?

adrinjalali commented 2 years ago

Came here to say what @jnothman said.

thomasjpfan commented 2 years ago

Yes, set_*_requests is the first accepted SLEP that configures the estimator state beyond set_params and __init__. Although, I think the comment still holds and set_output can be marked as experimental. It could be worth doing the same thing for set_*_requests, so we can make changes to it without deprecation.

As for me, I am happy with marking set_output as experimental.

adrinjalali commented 2 years ago

Making set_*_requests as experimental would really complicate things, since it'd mean the whole metadata routing would be experimental. Which means we'd have the experimental phase on top of the deprecation phase.

ogrisel commented 2 years ago

Making set_*_requests as experimental would really complicate things, since it'd mean the whole metadata routing would be experimental. Which means we'd have the experimental phase on top of the deprecation phase.

We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself.

adrinjalali commented 2 years ago

We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself.

If we tell users it's experimental, we should give them an option to use the library w/o the experimental features and w/o getting deprecation warnings, which is not really possible, cause if they don't use the experimental feature, they'll get the deprecation warning.

If we want to tell them it's experimental, then we'd have to allow them to pass things the old way w/o warning them, and then after the experimental phase is over, then the deprecation cycle begins. I really don't wanna go down that route.

lorentzenchr commented 2 years ago

Let‘s discuss here SLEP018 only. If there are no such intricacies as for the props/metadata case, starting with experimental makes sense to me.

amueller commented 2 years ago

@ogrisel I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it? Maybe that's ok, and if users complain and/or it's unworkable we can figure out something else. But I think in particular for notebooks, people always want dataframes and they'll just put the global config flag at the top of their notebooks with their imports, and we should enable that.

thomasjpfan commented 2 years ago

I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it?

Having a global configuration does introduce this inconsistency with third party estimators. I prefer not require it from third party estimators, but it would be a better UX if they choose to use the global configuration.

Maybe that's ok, and if users complain and/or it's unworkable we can figure out something else.

If users complain, then we direct them to the third party's repo? We already do this sometimes, if a library does not define a scikit-learn compatible estimator. I think third party libraries will be motivated to use the global configuration anyways becomes of the better UX.

lorentzenchr commented 2 years ago

I think, this SLEP should motivate why it considers pandas and not any other dataframe (libraries such as pyarrow, polars, …) as data container.

thomasjpfan commented 2 years ago

I updated the PR: