Closed NicolasHug closed 5 years ago
We should probably also explain what happens to a pipeline if there's at least one estimator which does not provide the API,
I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored).
I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored).
True. I was thinking of feature names API. That concern doesn't apply here.
Yeah this API will be quite simple tbh.
I wonder if we should discuss the n_features_out_
in this SLEP as well. Unfortunately I missed the meeting this morning. Who argued for t he SLEP? @rth @ogrisel @jnothman thoughts on whether n_features_out_
should be included here?
Maybe we should keep SLEPs simple when we can? (i.e. have a separate one for n_featuresout if needed)
Happy to keep it simple. But if the discussion is only around backward compatibility of check_estimator
then the SLEP will be a copy & paste.
@scikit-learn/core-devs can we merge and vote please?
I'm happy to have this merged and have a separate but very similar one for n_features_out_
.
If we go down the path of warning, it should be shown to library developers and not users.
Edit: Oops, we can have check_estimator
warn...
I'm OK with raising a warning instead of an exception in the common checks.
I updated the SLEP accordingly.
Should we merge?
@amueller @jnothman Are you happy with the new changes (https://github.com/scikit-learn/enhancement_proposals/pull/22#issuecomment-539024325) ? @adrinjalali too?
Let's merge?
Thanks Joel for your feedback.
We discussed with Andy and I updated the SLEP to propose now a single method _validate_data(X, y=None, reset=True, ...)
It seems that the only remaining discussion is https://github.com/scikit-learn/enhancement_proposals/pull/22#discussion_r335921604
If we're all ok to make the method private and not recommend libraries to use it, let's merge please. I'd like the implementation PR to be merged before we release.
I don't think there is any great need to rush this to release.
If we are keeping these helper methods private then they should be regarded as an implementation detail here: the essence of the SLEP becomes "estimators should provide n_featuresin, and this will be required in check_estimator two versions after introduction. No public helpers will be provided to assist third party libraries with this."
If you think this can pass the vote and lead to quick merge of the implementation, then go ahead and merge this and bring it to vote.
Personally I had understood that a major goal of this work was to facilitate data consistency validation between fit and predict/transform, which was extensible to the column reordering issues. Perhaps that API would then require another SLEP, where this SLEP wouldn't provide any leverage towards fixing the column ordering bug for scikit-learn-compatible estimators.
We also have the option of warning, rather than failing, about non-compliance, for a couple of releases.
That's an idea that I like.
I don't think there is any great need to rush this to release.
It is indeed too late now to get anything useful merged before the release. But I'd still like to get this done soon. All the effort put into scikit-learn/scikit-learn#13603 is pretty much lost now because we've been taking so long at coming up with a consensus.
Personally I had understood that a major goal of this work was to facilitate data consistency validation
Yes, this work will be useful for consistency checks, and also for feature names propagation. I still don't think that means we should make the method public. But if you disagree, please let it be known.
This SLEP has been cut down to say "Please define n_featuresin to pass check_estimator in the future".
For most of this SLEPS timeline, I have understood the goal of this SLEP is to improve data consistency validation. We want to do this, because it allows for data validation checks in our meta estimators, (mostly pipeline and column transformer). To actually perform these checks, the estimator needs to remember information about the data used in fit. If we want to perform these checks on third-party estimators, we need these estimators to also remember information about the data. Here are the options discussed in this PR:
n_features_in_
. (The private methods are not recommended to be used.)self
and will add attributes to the estimator.I think there is a path for the public function to work:
# 0.23
def validate_data(est, check_n_features_in=True, ..):
...
est.n_features_in_ = ...
# 0.24
# When we move to FutureWarning for deprecations, we can use the developer
# focused, DeprecationWarning, to signal this. (Also check_estimator will warn)
def validate_data(est, check_n_features_in=True,
check_feature_columns=False, ...)
# 0.25
def validate_data(est, check_n_features_in=True,
check_feature_columns=False,
check_other_things=False, ...)
# 0.26 (check_feature_columns=True)
def validate_data(est, check_n_features_in=True,
check_feature_columns=True,
check_other_things=False, ...)
I saw this SLEP as "laying out the foundation to provide data validation for third party librarys". This means getting third party libraries to call validate_data
, which right now will do something really simple. As we all more checks to validate_data
, third-party libraries will only need to add another flag to validate_data
to enable the new feature.
At the moment, we do not have a compelling story on "What are the benefits on defining n_featuresin" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features.
@thomasjpfan I'm afraid your understanding of the SLEP is outdated. We have dropped the check_something
parameters in favor of reset=True/false
, which would replace all of these.
Also, I don't think the method vs function discussion is relevant anymore.*
The current point of friction is whether the method should be private or not. I'd like to hear Joel's input on that.
*EDIT (OK, I understand now that if we want to make it public it should rather be a function)
At the moment, we do not have a compelling story on "What are the benefits on defining n_featuresin" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features.
I agree on that.
@amueller , could you please write out your thoughts on why n_feature_in_
will also help the feature_names SLEP?
I agree that we should try to move forward with this as it's blocking a lot of things, but we should also not rush it.
I think n_features_in_
and n_features_out_
are useful for model inspection on their own.
They will help for feature_names
because they allow us to create feature names. For example any of the scaler can easily create feature names if they know n_features_in_
. anything in components
can create feature names if it knows n_features_out
.
There are conceptually two things that I want:
a) adding two required attributes, n_features_in_
and n_features_out_
to all transformers (where applicable).
b) refactor the validation API to allow column name consistency checking and feature name propagation.
I don't think we should make any helpers public at this stage, so b) would be an internal refactoring, which wouldn't require a SLEP. If we want to create a public function in utils, that would also probably not require a SLEP (though I also don't really see the benefit of that for now).
Honestly I was a bit skeptical of the need for a SLEP here overall, but if anything then we need a SLEP for a), the addition of required attributes. In that case, we should probably write a slep for both of them that just say "we'll require these attributes, they are helpful for model inspection".
Even if we end up deciding we want a public function to validate n_features_in_
I'm not sure why that would require a SLEP. We've certainly changed check_estimator
frequently in the past without week-long discussions.
Btw, I'm also happy to accept the SLEP in the current stage, which is basically saying "we require n_featuresin". I can write another one for n_features_out_
later... I hope that'll be quicker...
Thanks for the feedback.
I updated the SLEP to only make it about the n_features_in_
attribute (2f37147). The _validate_data
method is only a private helper that does nothing but set and validate the attribute.
I'll leave the n_features_out_
attribute to another SLEP, for simplicity.
@amueller @adrinjalali @jnothman @thomasjpfan
good to merge and vote from my side
Nice, let's vote then \o/
Thanks for the merge and the reviews!
@jnothman should I wait after the release to call for a vote, or are you happy if we do it now?
I'm happy with the overall proposal:
A couple points (that could be added to the SLEP, or to the PR implementing it):
How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute? Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?
I'd be keen to not put any additional burden on estimator developers.... the question then is whether we require meta-estimators be flexible to the case that n_features_in_
is / is not present in the base estimator?
I would rather not set anything in init, because it is a change of behavior (things can be set in fit)
We don't set anything in init. There's only the case of the SparseCoder
where n_featurein is a property which is available right after instantiation, but the init code is unchanged. We could artificially call check_is_fitted
in the property, if that's what you mean.
How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute?
The check is run depending on the no_validation
tag. You can take a look at https://github.com/scikit-learn/scikit-learn/pull/13603/files#diff-a95fe0e40350c536a5e303e87ac979c4R2679 for details.
Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?
I'm in favor of having different levels of rigidity for the tests. I think we're already discussing that w.r.t. the error message checks.
We don't set anything in init. There's only the case of the SparseCoder where n_featurein is a property which is available right after instantiation, but the init code is unchanged.
All good thank you!
The check is run depending on the no_validation tag. You can take a look at https://github.com/scikit-learn/scikit-learn/pull/13603/files#diff-a95fe0e40350c536a5e303e87ac979c4R2679 for details.
This is great!! Thanks!
Thorough work!! Awesome. Thank you very much.
Definitely +1 on having a non-strict mode for tests, see https://github.com/scikit-learn/scikit-learn/issues/13969
the question then is whether we require meta-estimators be flexible to the case that n_featuresin is / is not present in the base estimator?
Excellent question. I think we should error for any behavior that would require it but not enforce it's presence. Right now the only behavior I can think of is accessing n_features_in_
on the meta-estimator. It might at some point also be feature_names_in_
or something like that.
But that also dictates that n_featuresin on a meta estimator is implemented either as a dynamic property, or in fit but resiliently...
What do you mean by dynamic property? Doesn't @if_delegate_has_attribute together with delegating work?
Yes that works
See PR https://github.com/scikit-learn/scikit-learn/pull/13603