Add get_feature_dependence to transformers

scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications

https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest

BSD 3-Clause "New" or "Revised" License

48 stars 34 forks source link

Add get_feature_dependence to transformers #5

Closed jnothman closed 5 years ago

jnothman commented 7 years ago

This is, I suppose, a WIP. But I'd like hints for what else needs to be done :)

jnothman commented 7 years ago

Ping @GaelVaroquaux @amueller @kmike

amueller commented 7 years ago

added this to my todo priority queue

amueller commented 7 years ago

a total skim tells me that it doesn't say how the feature contributes. I feel like for polynomial features for example we could do better. I guess that would be part of describe_features, so I'd like to include that in the SEP.

Can we create some use-cases? I think my main use-case is labeling coefficients of a classifier at the end of a pipeline (or feature importances). get_feature_dependence does not solve that problem.

You gave implementation examples, but no use-case examples. How would it look like to compress a dataset from a pipeline. Say we have make_pipeline(SomeFeatureSelection(), LinearSVC(penalty="L1")).

Do only transformers have get_feature_dependence?

So main comment: write code that uses this, and I think one use case is get human-readable string names, the other is knowing which features actually influence the output. If you have more, feel free to add.

jnothman commented 7 years ago

a total skim tells me that it doesn't say how the feature contributes.

No, it only says that the feature contributes. Saying how the feature contributes is obviously a lot more complicated when non-linear. Where the input is just an array of features, you can maybe assess contribution by throwing random data at it, but getting an explicit mapping between input and output features for some transformer(s) seems to be a straightforward, consistent way to inspect this roughly.

Given my realisation that a SelectFeaturesByName meta-transformer is only going to work with feature names being passed alongside data (rather than through a separate transform_feature_names function) I am less certain that being able to get feature descriptions only for selected features is necessary. Nonetheless, I will endeavour to add some usage examples.

amueller commented 7 years ago

Given my realisation that a SelectFeaturesByName meta-transformer is only going to work with feature names being passed alongside data (rather than through a separate transform_feature_names function)

I'm not sure I follow, but that might be because I didn't read my last 2000 github notifications.

jnothman commented 7 years ago

Haha :) the relevant comment is https://github.com/scikit-learn/scikit-learn/issues/6425#issuecomment-276575652, but don't rush

I think I will draft an example performing model compression with make_pipeline(CountVectorizer(), ..., LogisticRegression(penalty='l1')) where the middle steps could include any of SelectKBest, SparsePCA, PolynomialFeatures. The first step could equally be DictVectorizer or a union of CountVectorizers meaning that we can eliminate entire feature extraction processes by this method.

amueller commented 7 years ago

I need to think through your use-case but I think we should be able to support this. [And I might know next week if I get a 2yr grant to work on pandas integration and feature names ;)]

amueller commented 7 years ago

Also, it's 3134 notifications and it makes me sad :-/

jnothman commented 7 years ago

Need a hug?

On 2 February 2017 at 11:36, Andreas Mueller notifications@github.com wrote:

Also, it's 3134 notifications and it makes me sad :-/

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/5#issuecomment-276829817, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz623KoHsc04wfeKr30tp2TnAJcBNRks5rYSUagaJpZM4LyJfZ .

amueller commented 7 years ago

I'm good but thanks :)