Closed thomasjpfan closed 1 year ago
@adrinjalali I'm a bit confused as what your proposal really is. This sounds to me much closer to Andy's approach, but that doesn't pass the feature names during fit down the pipeline. Maybe you could expand on those bits a bit?
One of the reasons I wrote this SLEP was https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-699960961. It highlighted the performance issues for sparse data which is related to the InputArray
idea.
One can consider this SLEP a stepping stone toward passing feature names down the pipeline. Furthermore this SLEP is a prerequisite for the array_out
PR:
feature_names_in_
is required to make sure fit
and transform
has the same feature names.get_feature_names_out
needs to be implemented everywhere in the array_out
PR, but privately.This SLEP extends Andy's idea by including the interaction between feature_names_in_
and get_feature_names_out
. Since every estimator has the ability to store feature_names_in_
users would not need to pass in the input names in get_feature_names_out
in a pipeline or with a single estimator to get the output names.
Yes every step in a pipeline would not have the names, but these names can be obtained through slicing and if the estimator needs the name in fit
, we can add a parameter in pipeline to tell it to construct the dataframe which it will pass to one of its steps. (This is an idea proposed in this SLEP but not a part of this SLEP)
TLDR: I think this SLEP is simpler and resolves some of the pain points we have with feature names.
Updated SLEP is state that the Pipeline
will have feature_names_in_
.
This SLEP proposes adding the feature_namesin attribute to all estimators that will extract the feature names of X during fit.
I was thinking of the "outer" most layer of the API. For non-meta estimators, feature_names_in_
would be defined if the feature names can be extract from X
.
For metaestimators, such as pipeline, they would have pipe.feature_names_in_
defined but we would not guarantee that inner estimators would have feature_names_in_
defined.
I don't really get how you define "outer" or would ensure that from an implementation perspective, unless either feature_namesin was being set not by an estimator in its fit method, but by the caller of that estimator's fit method; or, if feature_namesin was only set when X was in a format that had names attached. Neither of these limitations are discussed in the SLEP afaict.
Neither of these limitations are discussed in the SLEP afaict.
Thank you for your thoughts, I'll update the SLEP accordingly.
I don't really get how you define "outer"
I was using "outer" and "inner" in context of a metaestimator. The metaestimator would be the "outer" estimator, while all estimators inside the meta estimators are "inner" estimators.
if feature_namesin was only set when X was in a format that had names attached.
This is the form I was considering.
unless either feature_namesin was being set not by an estimator in its fit method, but by the caller of that estimator's fit method
An metaestimator could have all sorts of estimators that it uses internally. I think having the metaestimator be responsible for setting feature_names_in_
for all its estimators would be too much of a requirement for metaestimator developers.
This SLEP is trying to proposal the bare minimum: If X
has names and is passed to fit
, then the estimator stores them as feature_names_in_
.
Metaestimators is a case where I would want it to delegate this responsibly to its inner estimators. This way the metaestimator can construct metaestimator.feature_names_in_
for itself. In the case of Pipeline
, it would be the feature_names_in_
of the first step.
specify the data type of feature_namesin and the return type of get_feature_names_out: are they necessarily lists? any Sequence? are arrays permitted/required?
I was considering any Sequence
: ffd0954
(#48).
@jnothman I have updated the SLEP to addresss your concerns:
7537b15
(#48)feature_names_in_
are set in fit
.So now my understanding (from conversation, not reviewing the SLEP) is that feature_names_in_
should be set directly by every estimator that cannot delegate it.
Let's clarify:
feature_names_in_ = None
.feature_names_in_ = X.columns
. Or is it X.columns.copy()
?.columns
?feature_names_in_
is what?When input is a list of strings, feature_namesin = None.
If we want to be consistent with n_feature_names
this would be the case. I think I would prefer it to not be defined, which means there is no validation in non-fit
methods.
When input is a pd.DataFrame, feature_namesin = X.columns. Or is it X.columns.copy()?
It would be safer to copy, but I would prefer not to.
When input is a 2d array or sparse matrix, feature_namesin is what?
Not defined and will not validate.
Do we provide ducktyped support for anything with .columns?
In the absence of a cross-library API for frame-like implementations, what happens when we don't support extracting column names from some data type in one version, but then add the capability in the next release. How does that affect other behaviours?
I think the safest thing to do is to restrict this SLEP to pandas until we get a frame-like protocol. When this frame-like protocol is defined, then we can say we support frame-like objects.
How does a third party library know which data types are supported for storing column names? Do we give them a helper function to store/generate names in all supported cases?
In https://github.com/scikit-learn/scikit-learn/pull/18010, I am pushing for using _validate_data
to handle this. Internally, a third party estimator can also use _check_feature_names
.
I think we should adjust the SLEP to make feature_names_in_
optional. We can increase its adoption by demonstrating how useful it is with pipelines and other meta-estimators.
But _validate_data isn't pubic. Since meta estimators need to handle the case that the attribute is absent or None in any case, making this optional seems reasonable.
Is this PR superseded by acceptance of SLEP007 in #59?
Is this PR superseded by acceptance of SLEP007 in #59?
No, that one is about how the feature names are generated, this one is about how they're propagated in a pipeline.
Does this supersede #18, meaning it's an alternative?
I'd say this is yet another effort to do the same thing as #18 wanted to do, but a lot more updated, after having gone down a few paths and not getting anywhere. I would need to read both again to say if this one is an alternative or just supersedes the other one.
Also, I'd be happy to merge both of them and continue discussing on a separate issue.
A bunch of SLEP015 is already include in SLEP007 which has already been accepted. What SLEP015 adds is an API for actually outputting pandas DataFrames.
I think https://github.com/scikit-learn/enhancement_proposals/pull/18 (SLEP008) has more or less been superseded by SLEP007.
back when we were writing sleps 7 and 8, I wrote 7 only to just talk about how we create the feature names, and 8 was more about the options we have to propagate them. Since then, I worked on a sklearn dataframe kinda object which we decided not to do, then worked on xarray, then Thomas worked on pandas, and the API also kinda evolved over time. I think we could at least focus on this SLEP for now, and then figure out what to do with other containers. We've gone back and forth with which container to use, or to use one at all, instead of propagating feature names alongside the data instead of with the data in a container.
This has been superseded by SLEP 18 #72 right?
I updated this PR so that this SLEP is now rejected.
Thanks @thomasjpfan . Then I think we can merge.
This SLEP details how a
feature_names_in_
attribute and aget_feature_names_out
method can be used together to obtain feature name propagation. I see there are two main goals for having this feature:Pipeline
.Related to https://github.com/scikit-learn/scikit-learn/pull/18444 -
get_feature_names_out
PR Related to https://github.com/scikit-learn/scikit-learn/pull/16772 -array_out
PR Related to https://github.com/scikit-learn/scikit-learn/pull/18010 -feature_names_out_
PRCC @scikit-learn/core-devs