Closed thomasjpfan closed 2 years ago
If we introduce new (internal) containers to add named columns to sparse matrices, we could even have a container that has several column blocks of different types (e.g. pandas dataframe and sparse matrices, possibly with different dtypes), for instance to store the output of a column transformer without any a priori data conversion and this would allow the downstream estimator in the pipeline to materialize this input in an optimal way.
Some column wise estimators such as coordinate-descent linear models and tree models could even accept data with mixed column blocks representation.
I think it's possible to have a new custom container that contains several blocks. It can also adopt the dataframe interchange protocol which allows the custom container to be converted to dataframes in other libraries.
Dataframe libraries such as: https://github.com/rapidsai/cudf/pull/90710, https://github.com/pandas-dev/pandas/pull/46141, https://github.com/vaexio/vaex/pull/1509, https://github.com/modin-project/modin/pull/4269 have adopted the protocol.
There are two underlying issues with the protocol:
In both cases, we can still store the blocks as CSR and C-ordered ndarrays, but when __dataframe__
is called a copy must be made. This is exactly what pandas does in it's dataframe protocol implementation.
As noted in the dev meeting, this SLEP will focus on set_output
and pandas output. If we were to create a custom container, then we can use the same API of set_output
to configure the transformer.
During the monthly developer meeting we decided to remove the "namedtensor" from this SLEP and stick with transform="pandas"
for now. I am debating between two APIs for sparse data:
transform="pandas"
, which will use Panda's Sparse extension arrays for sparse data, which is inefficient. The pro is that the feature names are passed through.transform="pandas_or_sparse"
, which will output a SciPy sparse matrix for sparse data, which is more efficient. The con is that the feature names are not passed through.The final solution is to have transform="pandas_or_namedsparse"
where we have a custom sparse container with column names, which was the initial version of this SLEP. But since we want to reduce the scope, I am okay with option 1, 2 or both.
An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation, and only add it as an improvement later with a separate discussion thread.
An alternative we talked about during the meeting was not to support sparse in this SLEP/implementation
I interpreted "not support sparse" to mean "not support sparse with feature names". Transformers will still be returning SciPy sparse data.
From an API point of view, setting set_output="pandas"
but returning a SciPy sparse matrix feels a little counterintuitive. This lead me down the two paths in https://github.com/scikit-learn/enhancement_proposals/pull/68#issuecomment-1153303691. set_output="pandas_or_sparse"
means dense data becomes a DataFrame and SciPy sparse remains the same.
To move this forward, I updated the SLEP to say that set_output="pandas"
only configures the output for dense data. In the future, we can extend it to configure sparse data using set_output="pandas_or_namedsparse"
, which is backward compatible.
The API will look a little strange if a user acutally wants the Pandas Sparse DataFrame. In that case, set_output="pandas_or_pandas"
would configure both dense and sparse to return a pandas DataFrame.
Thinking it over, the remaining option is to error when set_output="pandas"
and the output is sparse. When it comes to expanding the API, we do not have to deal with the behavior described in https://github.com/scikit-learn/enhancement_proposals/pull/68#issuecomment-1154609993
I updated the SLEP to error if set_output="pandas"
and the output is sparse.
I would like to go through an experimental period
+1. For me, it's fine to implement it this way without having it mentioned in the SLEP. However, if it helps to get agreement and voting faster, we can add it. In general, I find it a good idea to introduce great new features as experimental to have the chance to get them in, get experience with them, have some learnings and improve.
This this is the first time we introduce new public methods to set state on estimators (beyond set_params)
What about SLEP006?
Came here to say what @jnothman said.
Yes, set_*_requests
is the first accepted SLEP that configures the estimator state beyond set_params
and __init__
. Although, I think the comment still holds and set_output
can be marked as experimental. It could be worth doing the same thing for set_*_requests
, so we can make changes to it without deprecation.
As for me, I am happy with marking set_output
as experimental.
Making set_*_requests
as experimental would really complicate things, since it'd mean the whole metadata routing would be experimental. Which means we'd have the experimental phase on top of the deprecation phase.
Making set_*_requests as experimental would really complicate things, since it'd mean the whole metadata routing would be experimental. Which means we'd have the experimental phase on top of the deprecation phase.
We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself.
We can just document it as experimental (in the docstring and the changelog) without adding a complex explicit acknowledgement mechanism to enable the feature in the code itself.
If we tell users it's experimental, we should give them an option to use the library w/o the experimental features and w/o getting deprecation warnings, which is not really possible, cause if they don't use the experimental feature, they'll get the deprecation warning.
If we want to tell them it's experimental, then we'd have to allow them to pass things the old way w/o warning them, and then after the experimental phase is over, then the deprecation cycle begins. I really don't wanna go down that route.
Let‘s discuss here SLEP018 only. If there are no such intricacies as for the props/metadata case, starting with experimental makes sense to me.
@ogrisel I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it? Maybe that's ok, and if users complain and/or it's unworkable we can figure out something else. But I think in particular for notebooks, people always want dataframes and they'll just put the global config flag at the top of their notebooks with their imports, and we should enable that.
I'd also love to see a global flag, but then the erroring if it's not supported is a bit more tricky. Like would you require third party estimators to check the global flag and error if they don't support it?
Having a global configuration does introduce this inconsistency with third party estimators. I prefer not require it from third party estimators, but it would be a better UX if they choose to use the global configuration.
Maybe that's ok, and if users complain and/or it's unworkable we can figure out something else.
If users complain, then we direct them to the third party's repo? We already do this sometimes, if a library does not define a scikit-learn compatible estimator. I think third party libraries will be motivated to use the global configuration anyways becomes of the better UX.
I think, this SLEP should motivate why it considers pandas and not any other dataframe (libraries such as pyarrow, polars, …) as data container.
I updated the PR:
1d57415
(#68): Adds details about meta-estimators and fitted and non-fitted inner transformers.set_output
so users can configure transform
to return another DataFrame implementation, but this is outside the scope of this SLEP)
This SLEP proposes a
set_output
API used to configure the output oftransform
. The overall idea is to useset_output(transform="pandas_or_sparse")
, which outputs a pandas dataframe for dense data and a scipy sparse matrix for sparse data.Use cases
I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:
Future Extensions
The Pandas DataFrame is not suitable to provide column names for sparse data because it has performance issues as shown in #16772. A future extension to this SLEP is to have a
"pandas_or_namedsparse"
option. This option will use a scikit-learn specific sparse container that subclasses SciPy's sparse matrices. This sparse container includes the sparse data, feature names and index. This enables pipelines with Vectorizers without performance issues::CC @amueller @glemaitre @lorentzenchr