scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
48 stars 34 forks source link

VOTE: SLEP006 - Metadata Routing #65

Closed adrinjalali closed 2 years ago

adrinjalali commented 2 years ago

This PR is for us to discuss and collect votes for SLEP006 - Metadata Routing (was sample props).

A rendered version of the SLEP is available here, and detailed past discussions can be found under these PRs and these issues.

The current proposed implementation is drafted under https://github.com/scikit-learn/scikit-learn/pull/22083 where you can find a rendered version of the user guide here and a rendered version of the developer API (only applicable to third party developers and people who write custom estimators) here. These are found under metadata_routing.rst and plot_metadata_routing.py under the aforementioned PR respectively. Note that this PR does NOT touch scorers, splitters, or any of the estimators or meta-estimators in scikit-learn. It implements the machinery in BaseEstimator for consumers only. The PR is also not targeted to main, and instead it's to be merged into the sample-props branch on the main repo, with follow-up PRs to complete the implementation before merging into main.

Please leave your votes here, and comment here, on the mailing list, or open new PRs/issues in this repo or the main repo for specifics if you think further discussion is required.

Note that this vote is not to merge the implementation PR, and is rather to accept the SLEP, and the SLEP does NOT discuss implementation details and the API for us and third party developers; but we're more than happy to discuss the implementation and the API during this voting round.

We also plan to send out a survey asking third party developers for their opinion of our proposed developer API, parallel to this call for vote.

This vote will close on February 21, 2022.

adrinjalali commented 2 years ago

I think @glemaitre , @thomasjpfan , and I are a +1 on this, so it'd be nice to hear from others.

adrinjalali commented 2 years ago

Here's a list of steps I think we need to take for this work to be merged with main. Note that https://github.com/scikit-learn/scikit-learn/pull/22083 is to be merged into sample-props branch and not main, therefore it doesn't have to be perfect before being merged.

Our plan is to hopefully have this feature in 1.1, which we should be releasing in late April/early May. Therefore it'd be nice to have the vote and the review of the first PR done ASAP.

lorentzenchr commented 2 years ago

+1

I don't see a better way. A great thanks to @adrinjalali for all the work! I'm excited.

Some notes:

adrinjalali commented 2 years ago

@lorentzenchr passing column names is usually a constructor parameter, but one could pass that as metadata. But for your suggestion to work, we need .transform to output something other than a numpy.ndarray, which @thomasjpfan was working one, but we don't seem to come to a consensus there.

agramfort commented 2 years ago

My only remark at this point is that it should be explicit in the SLEP that props is metadata in the public API. We have get_metadata_routing but props={...}. I am wondering if we should not have only one new word in the public API (props or metadata).

my 2c

lorentzenchr commented 2 years ago

I am wondering if we should not have only one new word in the public API (props or metadata).

Could we handle this pragmatically, i.e. vote in favor of this PR and then have a different discussion/vote on a name change (props or metadata)?

adrinjalali commented 2 years ago

That parameter is called props basically for historical reasons, and since it's a new parameter and not implemented yet, we can just change it to metadata which I agree is a much better name for it. I've updated the SLEP accordingly.

jnothman commented 2 years ago

Through extensive discussions, we've previously resolved to use metadata. It's not my favourite term because it's overloaded with meanings. I'm more than happy to go ahead with it, as it seems to match what others here think metadata might mean.

amueller commented 2 years ago

Is there a good way to comment on specific parts of the SLEP? Or is it too late for that now anyway?

This is all about row-wise (per-sample) meta-data, not arbitrary meta-data, right? Or are we talking about arbitrary meta-data?

amueller commented 2 years ago

Note that if a consumer or a router starts accepting and consuming a certain metadata, the developer API enables developers to raise a warning and avoid silent behavior changes in users’ code. See the draft implementation for more details.

This seems to me like it should be explained a bit more in the slep? The default values for all the sample_weight routing should be documented here, I think? Is the current idea that it's on by default in fit if the estimator supports it and off if the estimator doesn't? I assume it's off for any other method. Is there any other default routing? Do meta-estimators get everything by default? And I assume grouped CV gets groups by default?

adrinjalali commented 2 years ago

Answers to @amueller 's questions:

Is there a good way to comment on specific parts of the SLEP? Or is it too late for that now anyway?

You could leave comments the way you did here.

This is all about row-wise (per-sample) meta-data, not arbitrary meta-data, right? Or are we talking about arbitrary meta-data?

No. That would be very backward incompatible. We accept arbitrary metadata in Pipeline now, and this SLEP doesn't change that. However, this SLEP makes it easy to introduce metadata about metadata to say what is row-metadata and what's not.

This seems to me like it should be explained a bit more in the slep?

We could add an example. But I'm not sure if this is an important part of the SLEP?

The default values for all the sample_weight routing should be documented here, I think?

We do say that the only thing requested by default is groups in Group*CV, otherwise nothing is requested by default.

Is the current idea that it's on by default in fit if the estimator supports it and off if the estimator doesn't? I assume it's off for any other method. Is there any other default routing? Do meta-estimators get everything by default?

Since we want the code to be explicit, and have no silent logic change when models start supporting metadata, nothing (except groups in Group*CV) is requested by default. I'm not sure what you mean by supported. By default, there will be no routing done, and if the user passes a metadata which is supported but not requested, we raise.

And I assume grouped CV gets groups by default?

Yes.

amueller commented 2 years ago

Since we want the code to be explicit, and have no silent logic change when models start supporting metadata, nothing (except groups in Group*CV) is requested by default.

Ok thanks that answers my question. I don't see the "nothing is requested by default except groups" in the SLEP but maybe I'm overlooking it?

amueller commented 2 years ago

No. That would be very backward incompatible. We accept arbitrary metadata in Pipeline now, and this SLEP doesn't change that. However, this SLEP makes it easy to introduce metadata about metadata to say what is row-metadata and what's not.

We accept arbitrary fit_params in Pipeline, which is arguably a bad choice, and now it looks like we're extending that bad choice to predict, transform, score, splitters and scorers?

How would it be backward incompatible to not support it in a new feature? And what are the current scenarios that we are supporting?

I guess implementation wise it's not that big of a difference but it would allow us to know ahead of time what we need to slice in cross-validation, and it would make for a more straight-forward mental model.

If we want to explicitly allow arbitrary metadata and leave the door open for per-column meta-data we should say that explicitly, and motivate that explicitly, I think.

amueller commented 2 years ago

To avoid the error, LogisticRegression must specify its metadata request by calling fit_requests

I assume a meta-estimator/pipeline requests a particular meta-data iff any of the sub-estimators request that meta-data, and so to not get an error, some component of a meta-estimator or pipeline needs to use each piece of meta-data?

amueller commented 2 years ago

Setting apart the implementation, I think my main concern is whether we want to restrict to sample-aligned meta-data, because I think using array shape to decide when to slice something is evil.

Also, I don't like adding so many methods to the top level object namespace, in particular methods without a common prefix that share a prefix with the most commonly used methods, that are relevant only for a very small portion of users. Though there could be some fixes for that that don't require changes to the essence of the SLEP (two I can think of are adding a metadata accessor similar to the dt accessor in Pandas, or just renaming the methods to all start with set_.)

jnothman commented 2 years ago

Ok thanks that answers my question. I don't see the "nothing is requested by default except groups" in the SLEP but maybe I'm overlooking it?

This is only true in the core library because of the distinction between Group and non-Group splitters where in the former groups is required and in the latter, groups is ignored.

amueller commented 2 years ago

This is only true in the core library because of the distinction between Group and non-Group splitters where in the former groups is required and in the latter, groups is ignored.

Ok, I was more talking about saying "in the core library nothing [but groups] is requested by default" explicitly in the doc.

adrinjalali commented 2 years ago

We accept arbitrary fit_params in Pipeline, which is arguably a bad choice, and now it looks like we're extending that bad choice to predict, transform, score, splitters and scorers?

How would it be backward incompatible to not support it in a new feature? And what are the current scenarios that we are supporting?

I guess implementation wise it's not that big of a difference but it would allow us to know ahead of time what we need to slice in cross-validation, and it would make for a more straight-forward mental model.

In most places where we have a meta-estimator accepting and forwarding **fit_params, we don't do any validation on them and just forward the fit_params to the child estimator. In Pipeline the user can pass whatever they want to the step they want, and in GridSearchCV we check the length and do the split if it's the same length as X (which I agree is evil). This SLEP is only about how to forward metadata to sub-objects' methods, and I really rather not talk about having different logic for fit and transform. Whatever fit was supporting, is supported in this SLEP, and the same goes for other methods. Note that this is a main reason as why it's called metadata and not sample params. We've had this discussion extensively and we converged on the current solution and I rather not change that now, at least for this vote.

However, I do agree we should be able to be explicit on what metadata is sample metadata, what's column metadata, and what's data(set) metadata, and that's not hard to add to the existing API. For instance, you could think of:

est = MyEstimator().fit_requests(sensitive_attribute=True, sensitive_attribute_type="row")
# or
est = MyEstimator().fit_requests(sensitive_attribute=MetadataInfo(request=True, type="row"))

or a variation of the above. It's not too hard to extend the proposed API to include this information, and I'm happy to have that implemented once the proposed API by this SLEP is implemented. Adding that wouldn't even need a SLEP IMO. But adding it here makes the SLEP and the implementation unnecessarily large, and they're already large enough that we're having a hard time getting them accepted or merged.

If we want to explicitly allow arbitrary metadata and leave the door open for per-column meta-data we should say that explicitly, and motivate that explicitly, I think.

We already do, this SLEP is not about that. It's about what we do in fit, to be done in other methods, and for consistency, we should do the same everywhere, and there are plans to allow user to be explicit on what can be sliced with rows and what not.

I assume a meta-estimator/pipeline requests a particular meta-data iff any of the sub-estimators request that meta-data, and so to not get an error, some component of a meta-estimator or pipeline needs to use each piece of meta-data?

Yes, clarified in the SLEP.

Also, I don't like adding so many methods to the top level object namespace, in particular methods without a common prefix that share a prefix with the most commonly used methods, that are relevant only for a very small portion of users. Though there could be some fixes for that that don't require changes to the essence of the SLEP (two I can think of are adding a metadata accessor similar to the dt accessor in Pandas, or just renaming the methods to all start with set_.)

I guess there's gonna be something in whatever solution we pick that somebody doesn't like 😁 . We've gone through a few iterations on the method(s) exposed to the user to set the metadata requests.

If we add an accessor or a prefix to the request methods, it makes the current arguably verbose code even more verbose:

est = SimplePipeline(
    transformer=ExampleTransformer()
    # we transformer's fit to receive sample_weight
    .fit_requests(sample_weight=True)
    # we want transformer's transform to receive groups
    .transform_requests(groups=True),
    classifier=RouterConsumerClassifier(
        estimator=ExampleClassifier()
        # we want this sub-estimator to receive sample_weight in fit
        .fit_requests(sample_weight=True)
        # but not groups in predict
        .predict_requests(groups=False),
    ).fit_requests(
        # and we want the meta-estimator to receive sample_weight as well
        sample_weight=True
    ),
)

would become

est = SimplePipeline(
    transformer=ExampleTransformer()
    # we transformer's fit to receive sample_weight
    .metadata.fit_requests(sample_weight=True)
    # we want transformer's transform to receive groups
    .metadata.transform_requests(groups=True),
    classifier=RouterConsumerClassifier(
        estimator=ExampleClassifier()
        # we want this sub-estimator to receive sample_weight in fit
        .metadata.fit_requests(sample_weight=True)
        # but not groups in predict
        .metadata.predict_requests(groups=False),
    ).metadata.fit_requests(
        # and we want the meta-estimator to receive sample_weight as well
        sample_weight=True
    ),
)

Also, I don't agree with you that having fewer top level methods should be a motivation for setting the API here. I would agree if we had tons of methods, but we don't, our API doesn't have many top level methods, and if anything, having it top level and prefixed with the name of the corresponding method, makes it easy for users to be reminded by their IDE's auto-complete that if there's a metadata passed to the method. And as for the majority of the users, they wouldn't need to change any code or write any new code anyway.

One thing which may make you happy is that the {method}_requests method would only exist if there's at least one metadata requested by that method. In the core library, that's only true for fit on our estimators, and therefore none of the other methods would be exposed to the user. So in practice, if we go with the accessor, there would be only one method under it in the entire library, at least as the status quo goes. If we release this, and then start adding support for metadata to other methods in the core library, we can always deprecate top level methods and put them under an accessor.

Ok, I was more talking about saying "in the core library nothing [but groups] is requested by default" explicitly in the doc.

Done.

jnothman commented 2 years ago

Re @adrinjalali suggestion of passing sensitive_attribute_type in fit_requests: this is the wrong design, but the actual design we need is much simpler and should not require the user to alter their request. By construction, the consumer will know the type it requires and can mark it as sample-aligned or static. Indeed, we can implement this immediately given the richness of your proposed requests, and just say that by default every metadata type is "sample-aligned". Cross-val can then not split data unless it's marked as "sample-aligned". (We might also consider "feature-aligned" but but this is trickier because the set and order of features can be changed without metaestimators / resamplers.)

In short, the SLEP will handle metadata that does not require splitting in cross-val beautifully, as long as the cross-val estimators become aware of how these requests will be marked.

I will admit I find fit_requests uncomfortable to read, despite accepting that proposal. I find the use of a present tense (not infinitive/imperative) verb awkward, and suspect that @amueller's proposal of set_fit_request will be more intuitive to users, and matches the naming of set_params which similarly mutates estimator attributes and returns self. I'm not yet persuaded by namespace accessors (est.request.fit(sample_weight=True)?) in this context.

adrinjalali commented 2 years ago

By construction, the consumer will know the type it requires and can mark it as sample-aligned or static. Indeed, we can implement this immediately given the richness of your proposed requests, and just say that by default every metadata type is "sample-aligned". Cross-val can then not split data unless it's marked as "sample-aligned".

I love this design, and is easy to implement. But I would rather not have it in the initial PR, and add it later (before merging into main). As for the SLEP, I've added a note to reflect that.

adrinjalali commented 2 years ago

I will admit I find fit_requests uncomfortable to read, despite accepting that proposal. I find the use of a present tense (not infinitive/imperative) verb awkward, and suspect that @amueller's proposal of set_fit_request will be more intuitive to users, and matches the naming of set_params which similarly mutates estimator attributes and returns self.

I don't think this is a controversial change (although I personally really prefer fit_requests to set_fit_request). If more people are happy with set_{method}_request, I'm happy with it. Changed the SLEP to reflect the new name.

amueller commented 2 years ago

We've had this discussion extensively and we converged on the current solution and I rather not change that now, at least for this vote.

Sorry I know I'm late to the game, can you point me towards this agreement? I'm not convinced by "we do the wrong thing for fit right now, we should do the same wrong thing for everything else", because despite what you say, that's really hard to change later and would be an incompatible change.

I would agree if we had tons of methods, but we don't, our API doesn't have many top level methods, and if anything, having it top level and prefixed with the name of the corresponding method, makes it easy for users to be reminded by their IDE's auto-complete that if there's a metadata passed to the method. And as for the majority of the users, they wouldn't need to change any code or write any new code anyway.

I would argue exactly the other way around. These are methods that 95% of our users won't need. But every time I autocomplete fit in Jupyter I will now be reminded that there's a method fit_request. For me that's just annoying and an additional keystroke, but I think it will be confusing for users that don't have the 5 years of context on this. And by design, there will be an autocomplete suggestion on every of the main methods, that is completely irrelevant and confusing to basically everybody but the most advanced users.

I'm not teaching many workshops right now but I can so imagine people asking me "what does this fit_request do" from someone that just learned what fit does.

adrinjalali commented 2 years ago

@amueller the last two commits to the SLEP change the text regarding both your latest points, does that mean you're now a +1? 😁

amueller commented 2 years ago

FWIW I think having a method per meta-data accepting method is the least ugly design ;)

amueller commented 2 years ago

Lol does the renaming invalidate previous votes? Ok I'm happy now. I'd really love there to be validation of meta-data but we can discuss this after this slep is accepted again maybe? I definitely agree with the rest of the overall design.

adrinjalali commented 2 years ago

FWIW I think having a method per meta-data accepting method is the least ugly design ;)

That's what I thought too, and that's why I had it that way at first. But now that I've worked with the other way around for a while, I find it quite neat actually.

Lol does the renaming invalidate previous votes? Ok I'm happy now.

I expect people to raise their voices if they disagree with the changes. I expect the changes not to be too controversial, and the ones introduced in the process of this vote have arguably improved the SLEP.

adrinjalali commented 2 years ago

I'd really love there to be validation of meta-data but we can discuss this after this slep is accepted again maybe?

What sort of validation? We do some, but I'm not sure what you mean.

amueller commented 2 years ago

That's what I thought too, and that's why I had it that way at first. But now that I've worked with the other way around for a while, I find it quite neat actually.

Wait I thought I agreed with the current design? That was at least my intention. Right now it's one method to set the meta-data per accepting method, right?

What sort of validation? We do some, but I'm not sure what you mean.

Validating that it's per-row.

adrinjalali commented 2 years ago

Wait I thought I agreed with the current design? That was at least my intention. Right now it's one method to set the meta-data per accepting method, right?

Yes, I had parsed your sentence wrong. We're all on the same page now :D

adrinjalali commented 2 years ago

Validating that it's per-row.

I don't think that needs a SLEP. I very much agree we should do it, and this SLEP allows us to implement the validation once metadata routing is implemented. I think it's relatively a minor issue since we probably don't want to do the validation everywhere, probably only in *GridSearchCV. But that's a separate discussion anyway.

adrinjalali commented 2 years ago

@jeremiedbb that's how it was at the beginning, but there's no good way for having a good signature there. Also, your example doesn't happen since predict and predict_proba don't support sample_weights. And as Andy mentioned, most people don't have any metadata in their pipeline anyway.

jnothman commented 2 years ago

there's no good way for having a good signature there

@jeremiedbb, does set_request(fit__sample_weight=True) appear markedly better to you? I think set_fit_request(sample_weight=True) would be easier for users to understand when reading and recall when writing.

jeremiedbb commented 2 years ago

does set_request(fit__sample_weight=True) appear markedly better to you

I prefer the current state, not a fan of the dunder :) I imagined something like

set_request({
    "fit": {"sample_weight": True},
    "score": {"sample_weight": True}
})

But it's not ideal either, and I can see that it might not necessarily be more understandable or recallable for the user.

And anyway, as pointed out, metadata will almost always be requested by 1 or 2 methods only so it's kind of a detail.

adrinjalali commented 2 years ago
set_request({
    "fit": {"sample_weight": True},
    "score": {"sample_weight": True}
})

That's pretty literally what we had in an early iteration, and we ended up moving away from it since it's not easy for users to remember how to write it down. Especially when you have an alias, it becomes:

set_request({
    "fit": {"sample_weight": "first_weights"},
    "score": {"sample_weight": "second_weights"}
})

And it's confusing if the dictionary is alias -> original_name or original_name -> alias. I personally had a hard time writing code with that signature. There's also no IDE auto-complete or hints to make it easy for users to write that, and there can be typos introducing silent bugs. For all of those reasons, we moved away from that signature and it was made private.

adrinjalali commented 2 years ago

This is very exciting. Counting the votes, we have 8 (voted here plus myself) in favor, and none against. A nice pass, and thanks for everybody involved. Still, reviewing https://github.com/scikit-learn/scikit-learn/pull/22083 is on the table ;)

jnothman commented 2 years ago

Congrats @adrinjalali. Thank you for pushing this through!