scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

Slep007 - feature names, their generation and the API #17

Closed adrinjalali closed 4 years ago

adrinjalali commented 5 years ago

Since it seems the discussion in https://github.com/scikit-learn/scikit-learn/pull/13307 requires a SLEP to decide on the style of feature names, and we talked about lowering the bar for a submitting a slep, I took the liberty to open one for us to discuss the styles and hopefully soon come to a consensus/conclusion.

amueller commented 5 years ago

My biggest concern right now is what the feature names should actually be, and I think my doubts are mostly around creating new feature names within pipelines and column transformer.

amueller commented 5 years ago

I don't like the pca__pca0 name because it's redundant and seems like it would be a common occurrence. But maybe I'm too worried about it?

Maybe explaining better the mechanism for feature names would make it clearer? I had posted some discussion in https://github.com/scikit-learn/scikit-learn/pull/12627

There I talked about three cases:

a) The transformer does a non-trivial column transformation, like OneHotEncoder or feature selection b) The transformer does "nothing" to the columns, like StandardScaler. c) The transformer does a "too complex" operation to the columns, like PCA.

I guess a slight confusion I have right now is because there is another case, which is the ColumnTransformer, which also creates new feature names by first prepending the transformer name, and then passing through the column names - that's different than pipelines which don't prepend the estimator name.

My question right now is: do we always prepend the estimator name in ColumnTransformer, even if it generates things like pca__pca0 or standardscaler__sepallength.

adrinjalali commented 5 years ago

regarding the redundancy of the feature names, we could

amueller commented 5 years ago

Not sure I really get the 2nd. The 1st is basically an if in the column transformer that checks if a column is used multiple times. And then what?

GaelVaroquaux commented 5 years ago

I've been playing with dask and pandas on very large datasets with many inconsistencies. The experience has really reminded me that code that does not have very consistent behavior leads to nasty bugs.

adrinjalali commented 5 years ago

The 1st is basically an if in the column transformer that checks if a column is used multiple times. And then what?

more like:

feature_names = []
for each transformer:
    feature_names.append(transformer.get_names())

if len(set(feature_names)) == len(feature_names) or not enforce_unique_names:
    return feature_names
else:
    append_prefixes_to_feature_names()

Not sure I really get the 2nd.

something along the lines of:

feature_names = {}
for each transformer:
    feature_names[transformer] = transformer.get_names()
if fname_trasformer is not None:
    return fname_transformer(feature_names)
amueller commented 5 years ago

@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name?

amueller commented 5 years ago

@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, fname_transformer also needs to get the transformer name at least. I guess that's what you meant? Are there other functions that would be sensible here apart from lambda t_name, f_name: t_name + f_name and lambda t_name, f_name: f_name?

I don't like 1) because it means that a feature name of an existing feature can change when I add additional features. That's quite counter-intuitive. Though I guess you could argue the same is true for make_column_transformer names?

GaelVaroquaux commented 5 years ago

@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name?

Yes.

GaelVaroquaux commented 5 years ago

No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names.

Point taken. Thank you.

amueller commented 5 years ago

I'm a bit annoyed at the duplication of the standard scaler, but I feel it will be hard to make this work better without blowing up the scope of the proposal, and it highlights a shortcoming of the current design; the shortcoming is unrelated to the feature name proposal, though, it's "only" related to dealing with pandas data frames.

adrinjalali commented 5 years ago

Are there other functions that would be sensible here apart from lambda t_name, f_name: t_name + f_name and lambda t_name, f_name: f_name?

Those are the most reasonable options I can think of.

There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?

Should we then have a parameter like output_feature_name_policy or something, which accepts a few values and based on that we generate the output feature names? It would be consistent throughout the package, so @GaelVaroquaux 's concern would be address I guess, and I suppose you would also fix your concerns with the verbose feature names @amueller ? Of course by default they're pretty verbose, but by passing an appropriate value to the parameter to some of the transformers/pipelines, the generated feature names would make sense?

@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, fname_transformer also needs to get the transformer name at least. I guess that's what you meant?

Yes, exactly, we can even get it from the self if we add a Mixin? Shouldn't be too much chage.

amueller commented 5 years ago

Should we then have a parameter like output_feature_name_policy or something, which accepts a few values and based on that we generate the output feature names?

parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior).

adrinjalali commented 5 years ago

parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior).

A parameter used by feature_names_out(). I was thinking of having it in each estimator, it'd be easier for the user to tune the parameters. Otherwise if we have it only for Pipeline and ColumnTransformer, then we'd need to provide a way for the user to specify may be a {estimator: policy} dict, which doesn't seem easier from the user side to me.

What we can do is to expose a function set_ofnames_policy() in the Mixin, so that it can easily be discovered by the user. Then we can start by adding the param to estimators' __init__, starting by the ones which would need it more frequently. But it doesn't have to be in all of them to begin with.

amueller commented 5 years ago

ok so you mean adding a parameter to the constructor.

I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to.

adrinjalali commented 5 years ago

I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to.

If we accept the extra parameter to the constructor, then we don't need the dict. The dict was the alternative: if we reject the proposal to have the constructor argument, and want to have the policy argument only available to a few objects such as ColumnTransformer, Pipeline, etc, then we need the dict for a more granular control over the policies of each underlying object. To be clear, this is not what I'm proposing and I personally dislike it.

thomasjpfan commented 5 years ago

A random thought: with this approach the following will have the same feature names

SelectKBest before PCA

Three of the original features were passed into the PCA.

pipe = make_pipeline(StandardScaler(), SelectKBest(k=3), PCA(n_components=2),
                     LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4')

SelectKBest after PCA

All the original features are passed into PCA, and two of the pca features are selected.

pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2),
                     LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4')
amueller commented 5 years ago

I don't think that they have the same feature names is an issue. I think the issue is that information is lost in the first case.

adrinjalali commented 5 years ago

A random thought: with this approach the following will have the same feature names

I think that's fine, the combination of feature names with the new pipeline visualizer should be enough for people to follow what happens inside.

amueller commented 5 years ago

@adrinjalali but in the first case there's no way for you to know what the selection did.

adrinjalali commented 4 years ago

@adrinjalali but in the first case there's no way for you to know what the selection did.

pipe[1].feature_names_out_ will be the original features selected by that transformer. Is that not enough?

amueller commented 4 years ago

@adrinjalali That depends a lot on what we consider enough, I think. It means that if the user plots the feature names in the "standard" way by looking at what goes into LogisticRegression, they'll miss part of the story. But to provide the full information we would need to use much more elaborate/verbose feature names (like pca1(x_1, x_2)). If there's many features this will quickly get out of hand, though...

amueller commented 4 years ago

@thomasjpfan you had a PR where you provided full input-output correspondences, right (or was it a comment somewhere)? That would allow tracking "everything" but might become very verbose.

thomasjpfan commented 4 years ago

I have a PR for ColumnTransformer https://github.com/scikit-learn/scikit-learn/pull/11639, it was needed to make inverse_transform work.

adrinjalali commented 4 years ago

I rewrote this SLEP to focus on how the names are generated. I have kept them simple, i.e. the user will not see a very verbose feature name at the end, since I don't think it's feasible to have them as soon as there's a PCA in the pipeline. At least the datasets I have dealt with, always have too many features for that to work.

amueller commented 4 years ago

Can you maybe reread https://github.com/scikit-learn/enhancement_proposals/pull/17#issuecomment-480413603

The SLEP should answer that question and/or address that concern. If we decide to have pca_pca0 and standardscaler_sepallength as feature names that's ok, but it's something that will be produced by the current proposal and that's not mentioned at all.

adrinjalali commented 4 years ago

The SLEP should answer that question and/or address that concern. If we decide to have pca_pca0 and standardscaler_sepallength as feature names that's ok, but it's something that will be produced by the current proposal and that's not mentioned at all.

Now the SLEP says:

This is the default behavior, and it can be tuned by constructor parameters if
the meta estimator allows it. For instance, a ``prefix=False`` may indicate
that a ``ColumnTransformer`` should not prefix the generated feature names with
the name of the step.

WDYT?

I also did read through your original PR's description and borrowed quite a bit from there now.

jnothman commented 4 years ago

@thomasjpfan you had a PR where you provided full input-output correspondences, right (or was it a comment somewhere)?

At one point I proposed a (sparse) matrix provided by each transformer to indicate which inputs corresponded to each output... Another approach would allow us to interrogate precisely how each feature was constructed, but that may be a lot of work for little gain.

adrinjalali commented 4 years ago

@amueller sorry, hadn't pushed the changes before your latest review.

amueller commented 4 years ago

Actually, I think there is 3 cases for non-meta transformers: 1) Each output feature corresponds to a single input feature. Then just use that input feature. 2) Each output feature is a simple function of a small number of input features. Then the output feature is some representation of the relationship. 3) Each output feature depends on all (or most of) the input features: create new names.

There is some slight tension between 1) and 2) I think. For example if you'd square every entry you'd keep the same feature names, but the entry corresponding to a squared feature in PolynomialFeatures is called x**2. I think 2) and 3) are clear-cut and we are struggling with one a bit. Adding a parameter to all of them would be a possible solution but I'm not sure if it's elegant.

adrinjalali commented 4 years ago

The main remaining question for me is whether we want to add prefix_feature_names to other transformers as well. It seems verbose but I don't see another way to get log(x0) if that's something that you want.

why does it have to be in all transformers? I thought it makes sense to start with the ones we're not sure what the behavior should be, and whether users want different behaviors or not.

1) Each output feature corresponds to a single input feature. Then just use that input feature. 2) Each output feature is a simple function of a small number of input features. Then the output feature is some representation of the relationship. 3) Each output feature depends on all (or most of) the input features: create new names.

But this misses the log(x0) case for a log transformer, for instance; and I think it'd be nice to have the option to have those verbose feature names.

adrinjalali commented 4 years ago

The discussion has become pretty long, and I'm getting lost. Could we please merge this PR, and have the other required changes in different PRs, maybe by you @amueller ? ;)

NicolasHug commented 4 years ago

Please update the title to something less generic than "Feature Names" since other SLEPs are dealing with feature names. The PR title is a great candidate.

Also Please add an abstract section at the top like the previous SLEPs or the template

(Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore)

adrinjalali commented 4 years ago

Also Please add an abstract section at the top like the previous SLEPs or the template

(Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore)

I rather have that in a separate PR by you maybe? ;) I can't think of any good text for that.

glemaitre commented 4 years ago

OK, I will catch up with all the discussion and then merge such that we can open subsequent PR on some specific points of the SLEP.

NicolasHug commented 4 years ago

Why? We can address the remaining points here, if any. The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many

glemaitre commented 4 years ago

The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many.

I have to go across 133 messages and unstack. For me, this is exactly the point raised by @adrinjalali there https://github.com/scikit-learn/enhancement_proposals/pull/17#discussion_r377676740 which make it difficult to catch up with SLEP that I did not look at since a while.

NicolasHug commented 4 years ago

In the case of this PR there's only one discussion that hasn't been resolved in the diff.

I have to go across 133 messages and unstack

I agree this is a problem. Personally, I try to provide timely feedback when I'm reviewing, and also try to address comments fast when I'm the author. It doesn't always work out though. Ideally, we would prioritize PRs better and allocate time, e.g. during the monthly meetings.

I'm being grumpy but I think opening new PRs will mostly scatter discussions, making it even more difficult to catch up. And it may add more work for everyone. But OK, let's try for the SLEPs. I don't think we should be doing that for main repo though.

adrinjalali commented 4 years ago

Seems we're mostly happy with the current state of this SLEP. I'll merge to have it on the website, then wait a bit for more feedback, and hopefully soon call a vote on it :)

NicolasHug commented 4 years ago

This PR was merged with only one approval (mine), and with an unaddressed comment from @amueller https://github.com/scikit-learn/enhancement_proposals/pull/17/files#r373808675. How and when will this comment be addressed? I don't believe it's less worthy than other comments.

adrinjalali commented 4 years ago
  1. Forgetting about that is one of the reasons I'd like to merge SLEPs much faster, to keep those discussions contained in their own issue/PR.
  2. This is now merged, not accepted, we can continue the discussion on separate issues. I'll open an issue with that question.
NicolasHug commented 4 years ago

Thanks.

Just a tip, it might seem obvious to some but I know that it's not always the case (for having discussed with other devs): looking at the diff (https://github.com/scikit-learn/enhancement_proposals/pull/17/files) is a great way to see which comments are still unaddressed, compared to the main conversation page. There might still be comments in the conversation page but at least those in the diff are hard to miss.