Closed adrinjalali closed 2 years ago
I think @glemaitre , @thomasjpfan , and I are a +1 on this, so it'd be nice to hear from others.
Here's a list of steps I think we need to take for this work to be merged with main
. Note that https://github.com/scikit-learn/scikit-learn/pull/22083 is to be merged into sample-props
branch and not main
, therefore it doesn't have to be perfect before being merged.
sample-props
. This PR only touches BaseEstimator
and hence consumers. It does NOT touch meta-estimators, scorers or cv splitters.sample-props
sample-props
(note that this involves an ongoing discussion on whether we'd like to mutate a scorer or not)get_metadata_routing
in easy cases, and a whole lot more in cases where we'd like to keep backward compatibility in parsing input args suchs as estimator__param
in Pipeline
sample-props
, do a few tests with third party estimators to see if they'd work out of the box. Note that consumer estimators should work out of the box as long as they inherit from BaseEstimator
and their fit
accepts metadata as explicit arguments rather than **kwargs
._metadata_requests.py
and work with scikit-learn meta-estimators w/o depending on the library.sample-props
into main
Our plan is to hopefully have this feature in 1.1, which we should be releasing in late April/early May. Therefore it'd be nice to have the vote and the review of the first PR done ASAP.
+1
I don't see a better way. A great thanks to @adrinjalali for all the work! I'm excited.
Some notes:
X = pandas.DataFrame({"col_weights": [1, 2, 1], ...})
log_reg = LogisticRegression().fit_requests(sample_weight=True)
cross_validate(
log_reg, X, y,
props={"sample_weight": "col_weights"},
)
@lorentzenchr passing column names is usually a constructor parameter, but one could pass that as metadata. But for your suggestion to work, we need .transform
to output something other than a numpy.ndarray
, which @thomasjpfan was working one, but we don't seem to come to a consensus there.
My only remark at this point is that it should be explicit in the SLEP that props
is metadata
in the public API. We have get_metadata_routing
but props={...}
. I am wondering if we should not have only one new word in the public API (props
or metadata
).
my 2c
I am wondering if we should not have only one new word in the public API (props or metadata).
Could we handle this pragmatically, i.e. vote in favor of this PR and then have a different discussion/vote on a name change (props or metadata)?
That parameter is called props
basically for historical reasons, and since it's a new parameter and not implemented yet, we can just change it to metadata
which I agree is a much better name for it. I've updated the SLEP accordingly.
Through extensive discussions, we've previously resolved to use metadata
. It's not my favourite term because it's overloaded with meanings. I'm more than happy to go ahead with it, as it seems to match what others here think metadata
might mean.
Is there a good way to comment on specific parts of the SLEP? Or is it too late for that now anyway?
This is all about row-wise (per-sample) meta-data, not arbitrary meta-data, right? Or are we talking about arbitrary meta-data?
Note that if a consumer or a router starts accepting and consuming a certain metadata, the developer API enables developers to raise a warning and avoid silent behavior changes in users’ code. See the draft implementation for more details.
This seems to me like it should be explained a bit more in the slep?
The default values for all the sample_weight
routing should be documented here, I think?
Is the current idea that it's on by default in fit
if the estimator supports it and off if the estimator doesn't?
I assume it's off for any other method. Is there any other default routing? Do meta-estimators get everything by default?
And I assume grouped CV gets groups by default?
Answers to @amueller 's questions:
Is there a good way to comment on specific parts of the SLEP? Or is it too late for that now anyway?
You could leave comments the way you did here.
This is all about row-wise (per-sample) meta-data, not arbitrary meta-data, right? Or are we talking about arbitrary meta-data?
No. That would be very backward incompatible. We accept arbitrary metadata in Pipeline
now, and this SLEP doesn't change that. However, this SLEP makes it easy to introduce metadata about metadata to say what is row-metadata and what's not.
This seems to me like it should be explained a bit more in the slep?
We could add an example. But I'm not sure if this is an important part of the SLEP?
The default values for all the
sample_weight
routing should be documented here, I think?
We do say that the only thing requested by default is groups
in Group*CV
, otherwise nothing is requested by default.
Is the current idea that it's on by default in
fit
if the estimator supports it and off if the estimator doesn't? I assume it's off for any other method. Is there any other default routing? Do meta-estimators get everything by default?
Since we want the code to be explicit, and have no silent logic change when models start supporting metadata, nothing (except groups
in Group*CV
) is requested by default. I'm not sure what you mean by supported. By default, there will be no routing done, and if the user passes a metadata which is supported but not requested, we raise.
And I assume grouped CV gets groups by default?
Yes.
Since we want the code to be explicit, and have no silent logic change when models start supporting metadata, nothing (except groups in Group*CV) is requested by default.
Ok thanks that answers my question. I don't see the "nothing is requested by default except groups" in the SLEP but maybe I'm overlooking it?
No. That would be very backward incompatible. We accept arbitrary metadata in Pipeline now, and this SLEP doesn't change that. However, this SLEP makes it easy to introduce metadata about metadata to say what is row-metadata and what's not.
We accept arbitrary fit_params
in Pipeline
, which is arguably a bad choice, and now it looks like we're extending that bad choice to predict
, transform
, score
, splitters and scorers?
How would it be backward incompatible to not support it in a new feature? And what are the current scenarios that we are supporting?
I guess implementation wise it's not that big of a difference but it would allow us to know ahead of time what we need to slice in cross-validation, and it would make for a more straight-forward mental model.
If we want to explicitly allow arbitrary metadata and leave the door open for per-column meta-data we should say that explicitly, and motivate that explicitly, I think.
To avoid the error, LogisticRegression must specify its metadata request by calling fit_requests
I assume a meta-estimator/pipeline requests a particular meta-data iff any of the sub-estimators request that meta-data, and so to not get an error, some component of a meta-estimator or pipeline needs to use each piece of meta-data?
Setting apart the implementation, I think my main concern is whether we want to restrict to sample-aligned meta-data, because I think using array shape to decide when to slice something is evil.
Also, I don't like adding so many methods to the top level object namespace, in particular methods without a common prefix that share a prefix with the most commonly used methods, that are relevant only for a very small portion of users. Though there could be some fixes for that that don't require changes to the essence of the SLEP (two I can think of are adding a metadata
accessor similar to the dt
accessor in Pandas, or just renaming the methods to all start with set_
.)
Ok thanks that answers my question. I don't see the "nothing is requested by default except groups" in the SLEP but maybe I'm overlooking it?
This is only true in the core library because of the distinction between Group and non-Group splitters where in the former groups
is required and in the latter, groups
is ignored.
This is only true in the core library because of the distinction between Group and non-Group splitters where in the former groups is required and in the latter, groups is ignored.
Ok, I was more talking about saying "in the core library nothing [but groups] is requested by default" explicitly in the doc.
We accept arbitrary
fit_params
inPipeline
, which is arguably a bad choice, and now it looks like we're extending that bad choice topredict
,transform
,score
, splitters and scorers?How would it be backward incompatible to not support it in a new feature? And what are the current scenarios that we are supporting?
I guess implementation wise it's not that big of a difference but it would allow us to know ahead of time what we need to slice in cross-validation, and it would make for a more straight-forward mental model.
In most places where we have a meta-estimator accepting and forwarding **fit_params
, we don't do any validation on them and just forward the fit_params
to the child estimator. In Pipeline
the user can pass whatever they want to the step they want, and in GridSearchCV
we check the length and do the split if it's the same length as X
(which I agree is evil). This SLEP is only about how to forward metadata to sub-objects' methods, and I really rather not talk about having different logic for fit
and transform
. Whatever fit
was supporting, is supported in this SLEP, and the same goes for other methods. Note that this is a main reason as why it's called metadata
and not sample params
. We've had this discussion extensively and we converged on the current solution and I rather not change that now, at least for this vote.
However, I do agree we should be able to be explicit on what metadata is sample metadata, what's column metadata, and what's data(set) metadata, and that's not hard to add to the existing API. For instance, you could think of:
est = MyEstimator().fit_requests(sensitive_attribute=True, sensitive_attribute_type="row")
# or
est = MyEstimator().fit_requests(sensitive_attribute=MetadataInfo(request=True, type="row"))
or a variation of the above. It's not too hard to extend the proposed API to include this information, and I'm happy to have that implemented once the proposed API by this SLEP is implemented. Adding that wouldn't even need a SLEP IMO. But adding it here makes the SLEP and the implementation unnecessarily large, and they're already large enough that we're having a hard time getting them accepted or merged.
If we want to explicitly allow arbitrary metadata and leave the door open for per-column meta-data we should say that explicitly, and motivate that explicitly, I think.
We already do, this SLEP is not about that. It's about what we do in fit
, to be done in other methods, and for consistency, we should do the same everywhere, and there are plans to allow user to be explicit on what can be sliced with rows and what not.
I assume a meta-estimator/pipeline requests a particular meta-data iff any of the sub-estimators request that meta-data, and so to not get an error, some component of a meta-estimator or pipeline needs to use each piece of meta-data?
Yes, clarified in the SLEP.
Also, I don't like adding so many methods to the top level object namespace, in particular methods without a common prefix that share a prefix with the most commonly used methods, that are relevant only for a very small portion of users. Though there could be some fixes for that that don't require changes to the essence of the SLEP (two I can think of are adding a
metadata
accessor similar to thedt
accessor in Pandas, or just renaming the methods to all start withset_
.)
I guess there's gonna be something in whatever solution we pick that somebody doesn't like 😁 . We've gone through a few iterations on the method(s) exposed to the user to set the metadata requests.
.set_metadata_requests
). It seems nice at the first glance: it's a single method with a very nice name. But the signature would be ugly and complicated..request_sample_weight(fit=True)
, but there were objections to that since we would expose many methods if the estimator has many fit
params, like in lightGBM
.{method}_requests
. Again, we're all making compromises here, no solution is making everybody happy.If we add an accessor or a prefix to the request methods, it makes the current arguably verbose code even more verbose:
est = SimplePipeline(
transformer=ExampleTransformer()
# we transformer's fit to receive sample_weight
.fit_requests(sample_weight=True)
# we want transformer's transform to receive groups
.transform_requests(groups=True),
classifier=RouterConsumerClassifier(
estimator=ExampleClassifier()
# we want this sub-estimator to receive sample_weight in fit
.fit_requests(sample_weight=True)
# but not groups in predict
.predict_requests(groups=False),
).fit_requests(
# and we want the meta-estimator to receive sample_weight as well
sample_weight=True
),
)
would become
est = SimplePipeline(
transformer=ExampleTransformer()
# we transformer's fit to receive sample_weight
.metadata.fit_requests(sample_weight=True)
# we want transformer's transform to receive groups
.metadata.transform_requests(groups=True),
classifier=RouterConsumerClassifier(
estimator=ExampleClassifier()
# we want this sub-estimator to receive sample_weight in fit
.metadata.fit_requests(sample_weight=True)
# but not groups in predict
.metadata.predict_requests(groups=False),
).metadata.fit_requests(
# and we want the meta-estimator to receive sample_weight as well
sample_weight=True
),
)
Also, I don't agree with you that having fewer top level methods should be a motivation for setting the API here. I would agree if we had tons of methods, but we don't, our API doesn't have many top level methods, and if anything, having it top level and prefixed with the name of the corresponding method, makes it easy for users to be reminded by their IDE's auto-complete that if there's a metadata passed to the method. And as for the majority of the users, they wouldn't need to change any code or write any new code anyway.
One thing which may make you happy is that the {method}_requests
method would only exist if there's at least one metadata requested by that method. In the core library, that's only true for fit
on our estimators, and therefore none of the other methods would be exposed to the user. So in practice, if we go with the accessor, there would be only one method under it in the entire library, at least as the status quo goes. If we release this, and then start adding support for metadata to other methods in the core library, we can always deprecate top level methods and put them under an accessor.
Ok, I was more talking about saying "in the core library nothing [but groups] is requested by default" explicitly in the doc.
Done.
Re @adrinjalali suggestion of passing sensitive_attribute_type
in fit_requests
: this is the wrong design, but the actual design we need is much simpler and should not require the user to alter their request. By construction, the consumer will know the type it requires and can mark it as sample-aligned or static. Indeed, we can implement this immediately given the richness of your proposed requests, and just say that by default every metadata type is "sample-aligned". Cross-val can then not split data unless it's marked as "sample-aligned". (We might also consider "feature-aligned" but but this is trickier because the set and order of features can be changed without metaestimators / resamplers.)
In short, the SLEP will handle metadata that does not require splitting in cross-val beautifully, as long as the cross-val estimators become aware of how these requests will be marked.
I will admit I find fit_requests
uncomfortable to read, despite accepting that proposal. I find the use of a present tense (not infinitive/imperative) verb awkward, and suspect that @amueller's proposal of set_fit_request
will be more intuitive to users, and matches the naming of set_params
which similarly mutates estimator attributes and returns self
. I'm not yet persuaded by namespace accessors (est.request.fit(sample_weight=True)
?) in this context.
By construction, the consumer will know the type it requires and can mark it as sample-aligned or static. Indeed, we can implement this immediately given the richness of your proposed requests, and just say that by default every metadata type is "sample-aligned". Cross-val can then not split data unless it's marked as "sample-aligned".
I love this design, and is easy to implement. But I would rather not have it in the initial PR, and add it later (before merging into main
). As for the SLEP, I've added a note to reflect that.
I will admit I find
fit_requests
uncomfortable to read, despite accepting that proposal. I find the use of a present tense (not infinitive/imperative) verb awkward, and suspect that @amueller's proposal ofset_fit_request
will be more intuitive to users, and matches the naming ofset_params
which similarly mutates estimator attributes and returnsself
.
I don't think this is a controversial change (although I personally really prefer fit_requests
to set_fit_request
). If more people are happy with set_{method}_request
, I'm happy with it. Changed the SLEP to reflect the new name.
We've had this discussion extensively and we converged on the current solution and I rather not change that now, at least for this vote.
Sorry I know I'm late to the game, can you point me towards this agreement? I'm not convinced by "we do the wrong thing for fit right now, we should do the same wrong thing for everything else", because despite what you say, that's really hard to change later and would be an incompatible change.
I would agree if we had tons of methods, but we don't, our API doesn't have many top level methods, and if anything, having it top level and prefixed with the name of the corresponding method, makes it easy for users to be reminded by their IDE's auto-complete that if there's a metadata passed to the method. And as for the majority of the users, they wouldn't need to change any code or write any new code anyway.
I would argue exactly the other way around. These are methods that 95% of our users won't need. But every time I autocomplete fit
in Jupyter I will now be reminded that there's a method fit_request
. For me that's just annoying and an additional keystroke, but I think it will be confusing for users that don't have the 5 years of context on this. And by design, there will be an autocomplete suggestion on every of the main methods, that is completely irrelevant and confusing to basically everybody but the most advanced users.
I'm not teaching many workshops right now but I can so imagine people asking me "what does this fit_request
do" from someone that just learned what fit
does.
@amueller the last two commits to the SLEP change the text regarding both your latest points, does that mean you're now a +1? 😁
FWIW I think having a method per meta-data accepting method is the least ugly design ;)
Lol does the renaming invalidate previous votes? Ok I'm happy now. I'd really love there to be validation of meta-data but we can discuss this after this slep is accepted again maybe? I definitely agree with the rest of the overall design.
FWIW I think having a method per meta-data accepting method is the least ugly design ;)
That's what I thought too, and that's why I had it that way at first. But now that I've worked with the other way around for a while, I find it quite neat actually.
Lol does the renaming invalidate previous votes? Ok I'm happy now.
I expect people to raise their voices if they disagree with the changes. I expect the changes not to be too controversial, and the ones introduced in the process of this vote have arguably improved the SLEP.
I'd really love there to be validation of meta-data but we can discuss this after this slep is accepted again maybe?
What sort of validation? We do some, but I'm not sure what you mean.
That's what I thought too, and that's why I had it that way at first. But now that I've worked with the other way around for a while, I find it quite neat actually.
Wait I thought I agreed with the current design? That was at least my intention. Right now it's one method to set the meta-data per accepting method, right?
What sort of validation? We do some, but I'm not sure what you mean.
Validating that it's per-row.
Wait I thought I agreed with the current design? That was at least my intention. Right now it's one method to set the meta-data per accepting method, right?
Yes, I had parsed your sentence wrong. We're all on the same page now :D
Validating that it's per-row.
I don't think that needs a SLEP. I very much agree we should do it, and this SLEP allows us to implement the validation once metadata routing is implemented. I think it's relatively a minor issue since we probably don't want to do the validation everywhere, probably only in *GridSearchCV
. But that's a separate discussion anyway.
@jeremiedbb that's how it was at the beginning, but there's no good way for having a good signature there. Also, your example doesn't happen since predict
and predict_proba
don't support sample_weights
. And as Andy mentioned, most people don't have any metadata in their pipeline anyway.
there's no good way for having a good signature there
@jeremiedbb, does set_request(fit__sample_weight=True)
appear markedly better to you? I think set_fit_request(sample_weight=True)
would be easier for users to understand when reading and recall when writing.
does set_request(fit__sample_weight=True) appear markedly better to you
I prefer the current state, not a fan of the dunder :) I imagined something like
set_request({
"fit": {"sample_weight": True},
"score": {"sample_weight": True}
})
But it's not ideal either, and I can see that it might not necessarily be more understandable or recallable for the user.
And anyway, as pointed out, metadata will almost always be requested by 1 or 2 methods only so it's kind of a detail.
set_request({ "fit": {"sample_weight": True}, "score": {"sample_weight": True} })
That's pretty literally what we had in an early iteration, and we ended up moving away from it since it's not easy for users to remember how to write it down. Especially when you have an alias, it becomes:
set_request({
"fit": {"sample_weight": "first_weights"},
"score": {"sample_weight": "second_weights"}
})
And it's confusing if the dictionary is alias -> original_name
or original_name -> alias
. I personally had a hard time writing code with that signature. There's also no IDE auto-complete or hints to make it easy for users to write that, and there can be typos introducing silent bugs. For all of those reasons, we moved away from that signature and it was made private.
This is very exciting. Counting the votes, we have 8 (voted here plus myself) in favor, and none against. A nice pass, and thanks for everybody involved. Still, reviewing https://github.com/scikit-learn/scikit-learn/pull/22083 is on the table ;)
Congrats @adrinjalali. Thank you for pushing this through!
This PR is for us to discuss and collect votes for SLEP006 - Metadata Routing (was sample props).
A rendered version of the SLEP is available here, and detailed past discussions can be found under these PRs and these issues.
The current proposed implementation is drafted under https://github.com/scikit-learn/scikit-learn/pull/22083 where you can find a rendered version of the user guide here and a rendered version of the developer API (only applicable to third party developers and people who write custom estimators) here. These are found under
metadata_routing.rst
andplot_metadata_routing.py
under the aforementioned PR respectively. Note that this PR does NOT touch scorers, splitters, or any of the estimators or meta-estimators in scikit-learn. It implements the machinery inBaseEstimator
for consumers only. The PR is also not targeted tomain
, and instead it's to be merged into thesample-props
branch on the main repo, with follow-up PRs to complete the implementation before merging intomain
.Please leave your votes here, and comment here, on the mailing list, or open new PRs/issues in this repo or the main repo for specifics if you think further discussion is required.
Note that this vote is not to merge the implementation PR, and is rather to accept the SLEP, and the SLEP does NOT discuss implementation details and the API for us and third party developers; but we're more than happy to discuss the implementation and the API during this voting round.
We also plan to send out a survey asking third party developers for their opinion of our proposed developer API, parallel to this call for vote.
This vote will close on February 21, 2022.