scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

Reject SLEP013 #36

Closed adrinjalali closed 1 year ago

adrinjalali commented 4 years ago

This PR proposes to "Accept" SLEP013.

jnothman commented 4 years ago

+1

One consideration not present here is whether there are cases where an existing attribute should be deprecated due to obsolescence... but I suppose we will handle them on a one-by-one basis.

Under motivation, I'd like to see specific reference to: inspection of feature structures and downstream models resulting from a union (FeatureUnion and ColumnTransformer); and name generation in get_feature_names.

amueller commented 4 years ago

+1 on accepting the SLEP. I agree with @jnothman but I'm unsure about the process of changing the SLEP during vote.

qinhanmin2014 commented 4 years ago

+1

GaelVaroquaux commented 4 years ago

I am fine with the principle of the SLEP.

However, the motivation is light: "Knowing the number of features that a transformer outputs is useful for inspection purposes". A tiny example would (it doesn't have to be a code example, just a few words of plain English to give to the reader the intent).

Thanks!

adrinjalali commented 4 years ago

I find it appropriate to apply final touches while discussing in the acceptance PR. Added an example here.

One consideration not present here is whether there are cases where an existing attribute should be deprecated due to obsolescence... but I suppose we will handle them on a one-by-one basis.

I don't think we have to include that in the SLEP. We can simply deprecate them if deemed appropriate case by case, as you say.

GaelVaroquaux commented 4 years ago

Thanks for the example Adrin. However, this is not what I was looking forward (sorry, I probably formulated my comment poorly). It was more an example of the purpose, for an end-user, of the attribute: what the attribute will be used for. Once again, it can be plain English, rather than code.

I think that, for each element of the public API, I'd like to have such a statement of purpose. It serves to guide the design: make sure that we are defining well the usage of the API elements, and make sure that everything is there for a good reason.

Thanks!

adrinjalali commented 4 years ago

@GaelVaroquaux would you please maybe give me a suggestion? I guess I'm struggling to come up with a statement of purpose :P

GaelVaroquaux commented 4 years ago

The reason that I am asking is because I have no suggestion: I don't know what I would use this feature for.

thomasjpfan commented 4 years ago

@GaelVaroquaux How does the following sound:

The number of features that a transformer outputs is useful for inspection after fitting. For example, it enables users see the number of output features without calling transform. Together with n_features_in_, users building pipelines would be able to verify their pipeline without needing to pass data through.

Edit: I meant n_features_in_

GaelVaroquaux commented 4 years ago

@thomasjpfan : but are we actually going to build pipelines that can verify their consistency? It seems to me that it requires much more engineering than n_features_out.

I am asking these questions because I naively do not know why we are adding "n_features_out". I'd like to know what problem it solves.

adrinjalali commented 4 years ago

Do you think it'd be okay to have n_feature_in_, feature_names_in_, feature_names_out_, but not n_features_out_?

GaelVaroquaux commented 4 years ago

Do you think it'd be okay to have n_featurein, feature_namesin, feature_namesout, but not n_featuresout?

Yes. I think that minimalistic APIs are best, for multiple reasons.

One reason is that facilitate implementation of a correct estimator object (many libraries do things that look roughly like scikit-learn estimators, but are not, which means that they break some code that expect the contract to be respected)

Another reason is that it helps understanding an object. For instance via tab completion.

amueller commented 4 years ago

The information of feature_names_out_ contains n_features_out_ via len so clearly there's nothing completely new that wouldn't be possible without it. But that requires to always have feature_names_out_ which could be problematic for sparse datasets. Adding n_features_out_ is a very simple addition that works in nearly all cases (I think FunctionTransformer is the only exception?) and allows to make sklearn transformers more transparent. I would argue that the tab completion argument is an argument in favor of n_features_out_: there is a clear and discoverable way to know what shape of array an estimator will produce.

The same arguments of using len also applied to n_features_in_ btw. I think it's conceptually a bit odd to have to compute the length of the names to get the number of features produced though.

Re ease of implementation: a backup feature_names_out_ can be computed from n_features_out_, so I assume we might have that as a property as a backup. That would mean that people that implement n_features_out_ will automatically get feature_names_out_ and so it will be easier for them to adhere to the API. Otherwise they are forced to create feature names themselves, which is a bit more tricky than setting the number of output features.

You could achieve the same with a helper of course, so that the third-party implementations include

self.feature_names_out_ = _generate_feature_names(n_features_out)

instead of

self.n_features_out_ = n_features_out

which would work together with

class TransformerMixin:
@property
def feature_names_out_(self):
    return generate_feature_names(self.n_features_out_)

But I don't see the helper function approach to be simpler.

GaelVaroquaux commented 4 years ago

Fair enough. But I still do not understand what it will be used for.

This is legitimate question: what is our purpose here? What will people use n_features_out for?

I think that this question should always drive API design.

amueller commented 4 years ago

I agree that's a legitimate question.

My answer would be "know the dimensionality of the output of a transformation without having to call transform". I guess you are thinking in terms of programmatic use, while I was thinking in terms of interactive use.

Maybe @jnothman wants to give his opinion, I think the original idea came from him ;)

jnothman commented 4 years ago

As I said it's valuable in a FeatureUnion or ColumnTransformer, at least when we don't have another way of determining where each downstream feature comes from. It could therefore help in model compression etc. (although a Boolean matrix relating inputs to outputs would be more informative for that). It also helps in generating generic feature names.

amueller commented 4 years ago

Hm if we think about pandas-in pandas-out maybe this attribute is not as important any more, and maybe focusing on that discussion is a better way to move forward?

jnothman commented 4 years ago

Re

It could therefore help in model compression etc. (although a Boolean matrix relating inputs to outputs would be more informative for that).

I forgot that I actually drafted this as a SLEP aaaaages ago. https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep003/proposal.html

jnothman commented 4 years ago

What do others think of SLEP003 instead of n_features_out_? It also records the number of output features, but along with it the mapping between input and output features. This can be helpful when features are dropped (https://github.com/scikit-learn/scikit-learn/issues/16426), makes it easier for the user to trace features through FeatureUnion/ColumnTransformer (without having an idea of feature "name", only one of feature identity), and generalises -- with a unified format -- feature dependency information contained in the semantics of diverse existing attributes including SparsePCA.n_components_, OneHotEncoder.categories_ and SelectorMixin.get_support(). It can enable model compression as well as tracing for inferential tasks.

amueller commented 4 years ago

Wow I also forgot that was a SLEP. I think it's cool, I thought it might be a bit heavy-handed, but this is something that has definitely been asked for in the past. Mostly SLEP003 is finer-grained information than feature names, but not entirely: you can't encode transformations of single features, right? So X^2 and X would look the same.

jnothman commented 4 years ago

Mostly SLEP003 is finer-grained information than feature names, but not entirely: you can't encode transformations of single features, right? So X^2 and X would look the same.

It's more machine-readable and unambiguous than feature names, perhaps... but yes, it's not describing the full function composition to derive each feature (which would call for a full symbolic calculus like sympy, no?).

amueller commented 4 years ago

Yes and we probably don't want that. It would also not allow for distinguishing the categories in OneHotEncoder btw. So I agree it would be good to have, but it's a use-case that's somewhat orthogonal to my "figure out what coefficients mean" requirement for feature names.

amueller commented 1 year ago

Should this slep be rejected as well? I don't think it provides much value in it's current form?