Closed adrinjalali closed 4 years ago
My biggest concern right now is what the feature names should actually be, and I think my doubts are mostly around creating new feature names within pipelines and column transformer.
I don't like the pca__pca0
name because it's redundant and seems like it would be a common occurrence. But maybe I'm too worried about it?
Maybe explaining better the mechanism for feature names would make it clearer? I had posted some discussion in https://github.com/scikit-learn/scikit-learn/pull/12627
There I talked about three cases:
a) The transformer does a non-trivial column transformation, like OneHotEncoder or feature selection b) The transformer does "nothing" to the columns, like StandardScaler. c) The transformer does a "too complex" operation to the columns, like PCA.
I guess a slight confusion I have right now is because there is another case, which is the ColumnTransformer
, which also creates new feature names by first prepending the transformer name, and then passing through the column names - that's different than pipelines which don't prepend the estimator name.
My question right now is: do we always prepend the estimator name in ColumnTransformer, even if it generates things like pca__pca0
or standardscaler__sepallength
.
regarding the redundancy of the feature names, we could
Not sure I really get the 2nd. The 1st is basically an if in the column transformer that checks if a column is used multiple times. And then what?
I've been playing with dask and pandas on very large datasets with many inconsistencies. The experience has really reminded me that code that does not have very consistent behavior leads to nasty bugs.
The 1st is basically an if in the column transformer that checks if a column is used multiple times. And then what?
more like:
feature_names = []
for each transformer:
feature_names.append(transformer.get_names())
if len(set(feature_names)) == len(feature_names) or not enforce_unique_names:
return feature_names
else:
append_prefixes_to_feature_names()
Not sure I really get the 2nd.
something along the lines of:
feature_names = {}
for each transformer:
feature_names[transformer] = transformer.get_names()
if fname_trasformer is not None:
return fname_transformer(feature_names)
@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name?
@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, fname_transformer
also needs to get the transformer name at least. I guess that's what you meant?
Are there other functions that would be sensible here apart from lambda t_name, f_name: t_name + f_name
and lambda t_name, f_name: f_name
?
I don't like 1) because it means that a feature name of an existing feature can change when I add additional features. That's quite counter-intuitive.
Though I guess you could argue the same is true for make_column_transformer
names?
@GaelVaroquaux I'm not sure I understand what your comment about inconsistent behavior is about. Was that re @adrinjalali's suggestion of sometimes not appending the name?
Yes.
No, you can not, as the comment suggests: this proposal is about creating feature names, not doing "pandas in, pandas out". StandardScaler will therefore create a numpy array as output, so ColumnTransformer can not use column names.
Point taken. Thank you.
I'm a bit annoyed at the duplication of the standard scaler, but I feel it will be hard to make this work better without blowing up the scope of the proposal, and it highlights a shortcoming of the current design; the shortcoming is unrelated to the feature name proposal, though, it's "only" related to dealing with pandas data frames.
Are there other functions that would be sensible here apart from lambda t_name, f_name: t_name + f_name and lambda t_name, f_name: f_name?
Those are the most reasonable options I can think of.
There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?
Should we then have a parameter like output_feature_name_policy
or something, which accepts a few values and based on that we generate the output feature names? It would be consistent throughout the package, so @GaelVaroquaux 's concern would be address I guess, and I suppose you would also fix your concerns with the verbose feature names @amueller ? Of course by default they're pretty verbose, but by passing an appropriate value to the parameter to some of the transformers/pipelines, the generated feature names would make sense?
@adrinjalali The second one doesn't really make sense the way you wrote it, I think. If you want to replicate the prefix behavior, fname_transformer also needs to get the transformer name at least. I guess that's what you meant?
Yes, exactly, we can even get it from the self
if we add a Mixin
? Shouldn't be too much chage.
Should we then have a parameter like output_feature_name_policy or something, which accepts a few values and based on that we generate the output feature names?
parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior).
parameter to what? each estimator? just the ColumnTransformer? global configuration (errrr probably not if column names will be used in actual behavior).
A parameter used by feature_names_out()
. I was thinking of having it in each estimator, it'd be easier for the user to tune the parameters. Otherwise if we have it only for Pipeline and ColumnTransformer, then we'd need to provide a way for the user to specify may be a {estimator: policy}
dict, which doesn't seem easier from the user side to me.
What we can do is to expose a function set_ofnames_policy()
in the Mixin
, so that it can easily be discovered by the user. Then we can start by adding the param to estimators' __init__
, starting by the ones which would need it more frequently. But it doesn't have to be in all of them to begin with.
ok so you mean adding a parameter to the constructor.
I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to.
I don't understand your argument about having to pass a dict. It's a constructor argument for a single estimator, right? So I know exactly which estimator I'm applying it to.
If we accept the extra parameter to the constructor, then we don't need the dict. The dict was the alternative: if we reject the proposal to have the constructor argument, and want to have the policy argument only available to a few objects such as ColumnTransformer, Pipeline, etc, then we need the dict for a more granular control over the policies of each underlying object. To be clear, this is not what I'm proposing and I personally dislike it.
A random thought: with this approach the following will have the same feature names
Three of the original features were passed into the PCA.
pipe = make_pipeline(StandardScaler(), SelectKBest(k=3), PCA(n_components=2),
LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4')
All the original features are passed into PCA, and two of the pca features are selected.
pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2),
LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].input_features_
# array(['pca0', 'pca1'], dtype='<U4')
I don't think that they have the same feature names is an issue. I think the issue is that information is lost in the first case.
A random thought: with this approach the following will have the same feature names
I think that's fine, the combination of feature names with the new pipeline visualizer should be enough for people to follow what happens inside.
@adrinjalali but in the first case there's no way for you to know what the selection did.
@adrinjalali but in the first case there's no way for you to know what the selection did.
pipe[1].feature_names_out_
will be the original features selected by that transformer. Is that not enough?
@adrinjalali That depends a lot on what we consider enough, I think. It means that if the user plots the feature names in the "standard" way by looking at what goes into LogisticRegression, they'll miss part of the story.
But to provide the full information we would need to use much more elaborate/verbose feature names (like pca1(x_1, x_2)
). If there's many features this will quickly get out of hand, though...
@thomasjpfan you had a PR where you provided full input-output correspondences, right (or was it a comment somewhere)? That would allow tracking "everything" but might become very verbose.
I have a PR for ColumnTransformer
https://github.com/scikit-learn/scikit-learn/pull/11639, it was needed to make inverse_transform
work.
I rewrote this SLEP to focus on how the names are generated. I have kept them simple, i.e. the user will not see a very verbose feature name at the end, since I don't think it's feasible to have them as soon as there's a PCA in the pipeline. At least the datasets I have dealt with, always have too many features for that to work.
Can you maybe reread https://github.com/scikit-learn/enhancement_proposals/pull/17#issuecomment-480413603
The SLEP should answer that question and/or address that concern. If we decide to have pca_pca0
and standardscaler_sepallength
as feature names that's ok, but it's something that will be produced by the current proposal and that's not mentioned at all.
The SLEP should answer that question and/or address that concern. If we decide to have pca_pca0 and standardscaler_sepallength as feature names that's ok, but it's something that will be produced by the current proposal and that's not mentioned at all.
Now the SLEP says:
This is the default behavior, and it can be tuned by constructor parameters if
the meta estimator allows it. For instance, a ``prefix=False`` may indicate
that a ``ColumnTransformer`` should not prefix the generated feature names with
the name of the step.
WDYT?
I also did read through your original PR's description and borrowed quite a bit from there now.
@thomasjpfan you had a PR where you provided full input-output correspondences, right (or was it a comment somewhere)?
At one point I proposed a (sparse) matrix provided by each transformer to indicate which inputs corresponded to each output... Another approach would allow us to interrogate precisely how each feature was constructed, but that may be a lot of work for little gain.
@amueller sorry, hadn't pushed the changes before your latest review.
Actually, I think there is 3 cases for non-meta transformers: 1) Each output feature corresponds to a single input feature. Then just use that input feature. 2) Each output feature is a simple function of a small number of input features. Then the output feature is some representation of the relationship. 3) Each output feature depends on all (or most of) the input features: create new names.
There is some slight tension between 1) and 2) I think. For example if you'd square every entry you'd keep the same feature names, but the entry corresponding to a squared feature in PolynomialFeatures is called x**2. I think 2) and 3) are clear-cut and we are struggling with one a bit. Adding a parameter to all of them would be a possible solution but I'm not sure if it's elegant.
The main remaining question for me is whether we want to add prefix_feature_names to other transformers as well. It seems verbose but I don't see another way to get log(x0) if that's something that you want.
why does it have to be in all transformers? I thought it makes sense to start with the ones we're not sure what the behavior should be, and whether users want different behaviors or not.
1) Each output feature corresponds to a single input feature. Then just use that input feature. 2) Each output feature is a simple function of a small number of input features. Then the output feature is some representation of the relationship. 3) Each output feature depends on all (or most of) the input features: create new names.
But this misses the log(x0)
case for a log transformer, for instance; and I think it'd be nice to have the option to have those verbose feature names.
The discussion has become pretty long, and I'm getting lost. Could we please merge this PR, and have the other required changes in different PRs, maybe by you @amueller ? ;)
Please update the title to something less generic than "Feature Names" since other SLEPs are dealing with feature names. The PR title is a great candidate.
Also Please add an abstract section at the top like the previous SLEPs or the template
(Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore)
Also Please add an abstract section at the top like the previous SLEPs or the template
(Sorry if this is a duplicate, I'm certain to have made this comment yesterday but I don't see it anymore)
I rather have that in a separate PR by you maybe? ;) I can't think of any good text for that.
OK, I will catch up with all the discussion and then merge such that we can open subsequent PR on some specific points of the SLEP.
Why? We can address the remaining points here, if any. The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many
The SLEP is reasonably small (and in good shape now) and I'd rather be reviewing one PR than many.
I have to go across 133 messages and unstack. For me, this is exactly the point raised by @adrinjalali there https://github.com/scikit-learn/enhancement_proposals/pull/17#discussion_r377676740 which make it difficult to catch up with SLEP that I did not look at since a while.
In the case of this PR there's only one discussion that hasn't been resolved in the diff.
I have to go across 133 messages and unstack
I agree this is a problem. Personally, I try to provide timely feedback when I'm reviewing, and also try to address comments fast when I'm the author. It doesn't always work out though. Ideally, we would prioritize PRs better and allocate time, e.g. during the monthly meetings.
I'm being grumpy but I think opening new PRs will mostly scatter discussions, making it even more difficult to catch up. And it may add more work for everyone. But OK, let's try for the SLEPs. I don't think we should be doing that for main repo though.
Seems we're mostly happy with the current state of this SLEP. I'll merge to have it on the website, then wait a bit for more feedback, and hopefully soon call a vote on it :)
This PR was merged with only one approval (mine), and with an unaddressed comment from @amueller https://github.com/scikit-learn/enhancement_proposals/pull/17/files#r373808675. How and when will this comment be addressed? I don't believe it's less worthy than other comments.
Thanks.
Just a tip, it might seem obvious to some but I know that it's not always the case (for having discussed with other devs): looking at the diff (https://github.com/scikit-learn/enhancement_proposals/pull/17/files) is a great way to see which comments are still unaddressed, compared to the main conversation page. There might still be comments in the conversation page but at least those in the diff are hard to miss.
Since it seems the discussion in https://github.com/scikit-learn/scikit-learn/pull/13307 requires a SLEP to decide on the style of feature names, and we talked about lowering the bar for a submitting a slep, I took the liberty to open one for us to discuss the styles and hopefully soon come to a consensus/conclusion.