Closed jnothman closed 5 years ago
Ping @GaelVaroquaux @amueller @kmike
added this to my todo priority queue
a total skim tells me that it doesn't say how the feature contributes. I feel like for polynomial features for example we could do better. I guess that would be part of describe_features
, so I'd like to include that in the SEP.
Can we create some use-cases? I think my main use-case is labeling coefficients of a classifier at the end of a pipeline (or feature importances). get_feature_dependence
does not solve that problem.
You gave implementation examples, but no use-case examples.
How would it look like to compress a dataset from a pipeline.
Say we have make_pipeline(SomeFeatureSelection(), LinearSVC(penalty="L1"))
.
Do only transformers have get_feature_dependence
?
So main comment: write code that uses this, and I think one use case is get human-readable string names, the other is knowing which features actually influence the output. If you have more, feel free to add.
a total skim tells me that it doesn't say how the feature contributes.
No, it only says that the feature contributes. Saying how the feature contributes is obviously a lot more complicated when non-linear. Where the input is just an array of features, you can maybe assess contribution by throwing random data at it, but getting an explicit mapping between input and output features for some transformer(s) seems to be a straightforward, consistent way to inspect this roughly.
Given my realisation that a SelectFeaturesByName
meta-transformer is only going to work with feature names being passed alongside data (rather than through a separate transform_feature_names
function) I am less certain that being able to get feature descriptions only for selected features is necessary. Nonetheless, I will endeavour to add some usage examples.
Given my realisation that a SelectFeaturesByName meta-transformer is only going to work with feature names being passed alongside data (rather than through a separate transform_feature_names function)
I'm not sure I follow, but that might be because I didn't read my last 2000 github notifications.
Haha :) the relevant comment is https://github.com/scikit-learn/scikit-learn/issues/6425#issuecomment-276575652, but don't rush
I think I will draft an example performing model compression with make_pipeline(CountVectorizer(), ..., LogisticRegression(penalty='l1'))
where the middle steps could include any of SelectKBest
, SparsePCA
, PolynomialFeatures
. The first step could equally be DictVectorizer
or a union of CountVectorizer
s meaning that we can eliminate entire feature extraction processes by this method.
I need to think through your use-case but I think we should be able to support this. [And I might know next week if I get a 2yr grant to work on pandas integration and feature names ;)]
Also, it's 3134 notifications and it makes me sad :-/
Need a hug?
On 2 February 2017 at 11:36, Andreas Mueller notifications@github.com wrote:
Also, it's 3134 notifications and it makes me sad :-/
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/5#issuecomment-276829817, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz623KoHsc04wfeKr30tp2TnAJcBNRks5rYSUagaJpZM4LyJfZ .
I'm good but thanks :)
This is, I suppose, a WIP. But I'd like hints for what else needs to be done :)