scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
48 stars 34 forks source link

SLEP015: Feature Names Propagation #48

Closed thomasjpfan closed 1 year ago

thomasjpfan commented 4 years ago

This SLEP details how a feature_names_in_ attribute and a get_feature_names_out method can be used together to obtain feature name propagation. I see there are two main goals for having this feature:

  1. Inspect the feature names inputted into the last step of a pipeline. This is address as the motivating problem in the SLEP.
  2. Be able to use panda's dataframes or xarray's dataarray as input for estimators in a pipeline. This allows for estimators to do something special based on the name of a feature. This is addressed in the "Enabling Functionality" section with a new parameter in Pipeline.

Related to https://github.com/scikit-learn/scikit-learn/pull/18444 - get_feature_names_out PR Related to https://github.com/scikit-learn/scikit-learn/pull/16772 - array_out PR Related to https://github.com/scikit-learn/scikit-learn/pull/18010 - feature_names_out_ PR

CC @scikit-learn/core-devs

thomasjpfan commented 4 years ago

@adrinjalali I'm a bit confused as what your proposal really is. This sounds to me much closer to Andy's approach, but that doesn't pass the feature names during fit down the pipeline. Maybe you could expand on those bits a bit?

One of the reasons I wrote this SLEP was https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-699960961. It highlighted the performance issues for sparse data which is related to the InputArray idea.

One can consider this SLEP a stepping stone toward passing feature names down the pipeline. Furthermore this SLEP is a prerequisite for the array_out PR:

  1. feature_names_in_ is required to make sure fit and transform has the same feature names.
  2. get_feature_names_out needs to be implemented everywhere in the array_out PR, but privately.

This SLEP extends Andy's idea by including the interaction between feature_names_in_ and get_feature_names_out. Since every estimator has the ability to store feature_names_in_ users would not need to pass in the input names in get_feature_names_out in a pipeline or with a single estimator to get the output names.

Yes every step in a pipeline would not have the names, but these names can be obtained through slicing and if the estimator needs the name in fit, we can add a parameter in pipeline to tell it to construct the dataframe which it will pass to one of its steps. (This is an idea proposed in this SLEP but not a part of this SLEP)

TLDR: I think this SLEP is simpler and resolves some of the pain points we have with feature names.

thomasjpfan commented 4 years ago

Updated SLEP is state that the Pipeline will have feature_names_in_.

thomasjpfan commented 4 years ago

This SLEP proposes adding the feature_namesin attribute to all estimators that will extract the feature names of X during fit.

I was thinking of the "outer" most layer of the API. For non-meta estimators, feature_names_in_ would be defined if the feature names can be extract from X.

For metaestimators, such as pipeline, they would have pipe.feature_names_in_ defined but we would not guarantee that inner estimators would have feature_names_in_ defined.

jnothman commented 4 years ago

I don't really get how you define "outer" or would ensure that from an implementation perspective, unless either feature_namesin was being set not by an estimator in its fit method, but by the caller of that estimator's fit method; or, if feature_namesin was only set when X was in a format that had names attached. Neither of these limitations are discussed in the SLEP afaict.

thomasjpfan commented 4 years ago

Neither of these limitations are discussed in the SLEP afaict.

Thank you for your thoughts, I'll update the SLEP accordingly.

I don't really get how you define "outer"

I was using "outer" and "inner" in context of a metaestimator. The metaestimator would be the "outer" estimator, while all estimators inside the meta estimators are "inner" estimators.

if feature_namesin was only set when X was in a format that had names attached.

This is the form I was considering.

unless either feature_namesin was being set not by an estimator in its fit method, but by the caller of that estimator's fit method

An metaestimator could have all sorts of estimators that it uses internally. I think having the metaestimator be responsible for setting feature_names_in_ for all its estimators would be too much of a requirement for metaestimator developers.

This SLEP is trying to proposal the bare minimum: If X has names and is passed to fit, then the estimator stores them as feature_names_in_.

Metaestimators is a case where I would want it to delegate this responsibly to its inner estimators. This way the metaestimator can construct metaestimator.feature_names_in_ for itself. In the case of Pipeline, it would be the feature_names_in_ of the first step.

thomasjpfan commented 4 years ago

specify the data type of feature_namesin and the return type of get_feature_names_out: are they necessarily lists? any Sequence? are arrays permitted/required?

I was considering any Sequence: ffd0954 (#48).

@jnothman I have updated the SLEP to addresss your concerns:

  1. Regarding meta-estimators: 7537b15 (#48)
  2. Abstract now explicitly states that feature_names_in_ are set in fit.
  3. An example of slicing can fail with a pipeline with one step was added.
jnothman commented 4 years ago

So now my understanding (from conversation, not reviewing the SLEP) is that feature_names_in_ should be set directly by every estimator that cannot delegate it.

Let's clarify:

thomasjpfan commented 4 years ago

When input is a list of strings, feature_namesin = None.

If we want to be consistent with n_feature_names this would be the case. I think I would prefer it to not be defined, which means there is no validation in non-fit methods.

When input is a pd.DataFrame, feature_namesin = X.columns. Or is it X.columns.copy()?

It would be safer to copy, but I would prefer not to.

When input is a 2d array or sparse matrix, feature_namesin is what?

Not defined and will not validate.

Do we provide ducktyped support for anything with .columns?

In the absence of a cross-library API for frame-like implementations, what happens when we don't support extracting column names from some data type in one version, but then add the capability in the next release. How does that affect other behaviours?

I think the safest thing to do is to restrict this SLEP to pandas until we get a frame-like protocol. When this frame-like protocol is defined, then we can say we support frame-like objects.

How does a third party library know which data types are supported for storing column names? Do we give them a helper function to store/generate names in all supported cases?

In https://github.com/scikit-learn/scikit-learn/pull/18010, I am pushing for using _validate_data to handle this. Internally, a third party estimator can also use _check_feature_names.

I think we should adjust the SLEP to make feature_names_in_ optional. We can increase its adoption by demonstrating how useful it is with pipelines and other meta-estimators.

jnothman commented 4 years ago

But _validate_data isn't pubic. Since meta estimators need to handle the case that the attribute is absent or None in any case, making this optional seems reasonable.

lorentzenchr commented 2 years ago

Is this PR superseded by acceptance of SLEP007 in #59?

adrinjalali commented 2 years ago

Is this PR superseded by acceptance of SLEP007 in #59?

No, that one is about how the feature names are generated, this one is about how they're propagated in a pipeline.

lorentzenchr commented 2 years ago

Does this supersede #18, meaning it's an alternative?

adrinjalali commented 2 years ago

I'd say this is yet another effort to do the same thing as #18 wanted to do, but a lot more updated, after having gone down a few paths and not getting anywhere. I would need to read both again to say if this one is an alternative or just supersedes the other one.

adrinjalali commented 2 years ago

Also, I'd be happy to merge both of them and continue discussing on a separate issue.

thomasjpfan commented 2 years ago

A bunch of SLEP015 is already include in SLEP007 which has already been accepted. What SLEP015 adds is an API for actually outputting pandas DataFrames.

I think https://github.com/scikit-learn/enhancement_proposals/pull/18 (SLEP008) has more or less been superseded by SLEP007.

adrinjalali commented 2 years ago

back when we were writing sleps 7 and 8, I wrote 7 only to just talk about how we create the feature names, and 8 was more about the options we have to propagate them. Since then, I worked on a sklearn dataframe kinda object which we decided not to do, then worked on xarray, then Thomas worked on pandas, and the API also kinda evolved over time. I think we could at least focus on this SLEP for now, and then figure out what to do with other containers. We've gone back and forth with which container to use, or to use one at all, instead of propagating feature names alongside the data instead of with the data in a container.

amueller commented 1 year ago

This has been superseded by SLEP 18 #72 right?

thomasjpfan commented 1 year ago

I updated this PR so that this SLEP is now rejected.

adrinjalali commented 1 year ago

Thanks @thomasjpfan . Then I think we can merge.