scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
48 stars 34 forks source link

SLEP010 n_features_in_ attribute #22

Closed NicolasHug closed 5 years ago

NicolasHug commented 5 years ago

See PR https://github.com/scikit-learn/scikit-learn/pull/13603

NicolasHug commented 5 years ago

We should probably also explain what happens to a pipeline if there's at least one estimator which does not provide the API,

I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored).

adrinjalali commented 5 years ago

I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored).

True. I was thinking of feature names API. That concern doesn't apply here.

amueller commented 5 years ago

Yeah this API will be quite simple tbh. I wonder if we should discuss the n_features_out_ in this SLEP as well. Unfortunately I missed the meeting this morning. Who argued for t he SLEP? @rth @ogrisel @jnothman thoughts on whether n_features_out_ should be included here?

NicolasHug commented 5 years ago

Maybe we should keep SLEPs simple when we can? (i.e. have a separate one for n_featuresout if needed)

amueller commented 5 years ago

Happy to keep it simple. But if the discussion is only around backward compatibility of check_estimator then the SLEP will be a copy & paste.

NicolasHug commented 5 years ago

@scikit-learn/core-devs can we merge and vote please?

adrinjalali commented 5 years ago

I'm happy to have this merged and have a separate but very similar one for n_features_out_.

thomasjpfan commented 5 years ago

If we go down the path of warning, it should be shown to library developers and not users.

Edit: Oops, we can have check_estimator warn...

NicolasHug commented 5 years ago

I'm OK with raising a warning instead of an exception in the common checks.

I updated the SLEP accordingly.

Should we merge?

NicolasHug commented 5 years ago

@amueller @jnothman Are you happy with the new changes (https://github.com/scikit-learn/enhancement_proposals/pull/22#issuecomment-539024325) ? @adrinjalali too?

Let's merge?

NicolasHug commented 5 years ago

Thanks Joel for your feedback.

We discussed with Andy and I updated the SLEP to propose now a single method _validate_data(X, y=None, reset=True, ...)

It seems that the only remaining discussion is https://github.com/scikit-learn/enhancement_proposals/pull/22#discussion_r335921604

NicolasHug commented 5 years ago

If we're all ok to make the method private and not recommend libraries to use it, let's merge please. I'd like the implementation PR to be merged before we release.

jnothman commented 5 years ago

I don't think there is any great need to rush this to release.

If we are keeping these helper methods private then they should be regarded as an implementation detail here: the essence of the SLEP becomes "estimators should provide n_featuresin, and this will be required in check_estimator two versions after introduction. No public helpers will be provided to assist third party libraries with this."

If you think this can pass the vote and lead to quick merge of the implementation, then go ahead and merge this and bring it to vote.

Personally I had understood that a major goal of this work was to facilitate data consistency validation between fit and predict/transform, which was extensible to the column reordering issues. Perhaps that API would then require another SLEP, where this SLEP wouldn't provide any leverage towards fixing the column ordering bug for scikit-learn-compatible estimators.

GaelVaroquaux commented 5 years ago

We also have the option of warning, rather than failing, about non-compliance, for a couple of releases.

That's an idea that I like.

NicolasHug commented 5 years ago

I don't think there is any great need to rush this to release.

It is indeed too late now to get anything useful merged before the release. But I'd still like to get this done soon. All the effort put into scikit-learn/scikit-learn#13603 is pretty much lost now because we've been taking so long at coming up with a consensus.

Personally I had understood that a major goal of this work was to facilitate data consistency validation

Yes, this work will be useful for consistency checks, and also for feature names propagation. I still don't think that means we should make the method public. But if you disagree, please let it be known.

thomasjpfan commented 5 years ago

This SLEP has been cut down to say "Please define n_featuresin to pass check_estimator in the future".

For most of this SLEPS timeline, I have understood the goal of this SLEP is to improve data consistency validation. We want to do this, because it allows for data validation checks in our meta estimators, (mostly pipeline and column transformer). To actually perform these checks, the estimator needs to remember information about the data used in fit. If we want to perform these checks on third-party estimators, we need these estimators to also remember information about the data. Here are the options discussed in this PR:

  1. This SLEP in its current state, which asks estimators to defined n_features_in_. (The private methods are not recommended to be used.)
  2. Create public methods that will help do some of the work. (I do not think this would be good, since these public methods are only for developers, and it would not be good to expose it to users)
  3. Create public functions that will help do some of the work. The public functions will need to be passed self and will add attributes to the estimator.

I think there is a path for the public function to work:

# 0.23
def validate_data(est, check_n_features_in=True, ..):
    ...
    est.n_features_in_ = ...

# 0.24
# When we move to FutureWarning for deprecations, we can use the developer
# focused, DeprecationWarning, to signal this. (Also check_estimator will warn)
def validate_data(est, check_n_features_in=True, 
                  check_feature_columns=False, ...)

# 0.25
def validate_data(est, check_n_features_in=True, 
                  check_feature_columns=False, 
                  check_other_things=False, ...)

# 0.26 (check_feature_columns=True)
def validate_data(est, check_n_features_in=True, 
                  check_feature_columns=True, 
                  check_other_things=False, ...)

Closing

I saw this SLEP as "laying out the foundation to provide data validation for third party librarys". This means getting third party libraries to call validate_data, which right now will do something really simple. As we all more checks to validate_data, third-party libraries will only need to add another flag to validate_data to enable the new feature.

At the moment, we do not have a compelling story on "What are the benefits on defining n_featuresin" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features.

NicolasHug commented 5 years ago

@thomasjpfan I'm afraid your understanding of the SLEP is outdated. We have dropped the check_something parameters in favor of reset=True/false, which would replace all of these.

Also, I don't think the method vs function discussion is relevant anymore.*

The current point of friction is whether the method should be private or not. I'd like to hear Joel's input on that.

*EDIT (OK, I understand now that if we want to make it public it should rather be a function)

NicolasHug commented 5 years ago

At the moment, we do not have a compelling story on "What are the benefits on defining n_featuresin" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features.

I agree on that.

@amueller , could you please write out your thoughts on why n_feature_in_ will also help the feature_names SLEP?

amueller commented 5 years ago

I agree that we should try to move forward with this as it's blocking a lot of things, but we should also not rush it.

I think n_features_in_ and n_features_out_ are useful for model inspection on their own. They will help for feature_names because they allow us to create feature names. For example any of the scaler can easily create feature names if they know n_features_in_. anything in components can create feature names if it knows n_features_out.

There are conceptually two things that I want: a) adding two required attributes, n_features_in_ and n_features_out_ to all transformers (where applicable).

b) refactor the validation API to allow column name consistency checking and feature name propagation.

I don't think we should make any helpers public at this stage, so b) would be an internal refactoring, which wouldn't require a SLEP. If we want to create a public function in utils, that would also probably not require a SLEP (though I also don't really see the benefit of that for now).

Honestly I was a bit skeptical of the need for a SLEP here overall, but if anything then we need a SLEP for a), the addition of required attributes. In that case, we should probably write a slep for both of them that just say "we'll require these attributes, they are helpful for model inspection".

Even if we end up deciding we want a public function to validate n_features_in_ I'm not sure why that would require a SLEP. We've certainly changed check_estimator frequently in the past without week-long discussions.

amueller commented 5 years ago

Btw, I'm also happy to accept the SLEP in the current stage, which is basically saying "we require n_featuresin". I can write another one for n_features_out_ later... I hope that'll be quicker...

NicolasHug commented 5 years ago

Thanks for the feedback.

I updated the SLEP to only make it about the n_features_in_ attribute (2f37147). The _validate_data method is only a private helper that does nothing but set and validate the attribute.

I'll leave the n_features_out_ attribute to another SLEP, for simplicity.

@amueller @adrinjalali @jnothman @thomasjpfan

amueller commented 5 years ago

good to merge and vote from my side

adrinjalali commented 5 years ago

Nice, let's vote then \o/

NicolasHug commented 5 years ago

Thanks for the merge and the reviews!

@jnothman should I wait after the release to call for a vote, or are you happy if we do it now?

GaelVaroquaux commented 4 years ago

I'm happy with the overall proposal:

A couple points (that could be added to the SLEP, or to the PR implementing it):

jnothman commented 4 years ago

How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute? Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?

I'd be keen to not put any additional burden on estimator developers.... the question then is whether we require meta-estimators be flexible to the case that n_features_in_ is / is not present in the base estimator?

NicolasHug commented 4 years ago

I would rather not set anything in init, because it is a change of behavior (things can be set in fit)

We don't set anything in init. There's only the case of the SparseCoder where n_featurein is a property which is available right after instantiation, but the init code is unchanged. We could artificially call check_is_fitted in the property, if that's what you mean.

How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute?

The check is run depending on the no_validation tag. You can take a look at https://github.com/scikit-learn/scikit-learn/pull/13603/files#diff-a95fe0e40350c536a5e303e87ac979c4R2679 for details.

Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?

I'm in favor of having different levels of rigidity for the tests. I think we're already discussing that w.r.t. the error message checks.

GaelVaroquaux commented 4 years ago

We don't set anything in init. There's only the case of the SparseCoder where n_featurein is a property which is available right after instantiation, but the init code is unchanged.

All good thank you!

The check is run depending on the no_validation tag. You can take a look at https://github.com/scikit-learn/scikit-learn/pull/13603/files#diff-a95fe0e40350c536a5e303e87ac979c4R2679 for details.

This is great!! Thanks!

Thorough work!! Awesome. Thank you very much.

amueller commented 4 years ago

Definitely +1 on having a non-strict mode for tests, see https://github.com/scikit-learn/scikit-learn/issues/13969

amueller commented 4 years ago

the question then is whether we require meta-estimators be flexible to the case that n_featuresin is / is not present in the base estimator?

Excellent question. I think we should error for any behavior that would require it but not enforce it's presence. Right now the only behavior I can think of is accessing n_features_in_ on the meta-estimator. It might at some point also be feature_names_in_ or something like that.

jnothman commented 4 years ago

But that also dictates that n_featuresin on a meta estimator is implemented either as a dynamic property, or in fit but resiliently...

amueller commented 4 years ago

What do you mean by dynamic property? Doesn't @if_delegate_has_attribute together with delegating work?

jnothman commented 4 years ago

Yes that works