scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

Accept SLEP006 #52

Closed jnothman closed 3 years ago

jnothman commented 3 years ago

This PR proposes to "Accept" SLEP006.

adrinjalali commented 3 years ago

+1

amueller commented 3 years ago

This is a stupid question but where exactly is the proposal for the user code? Should I look at the PR? the "test cases" section of the slep?

jnothman commented 3 years ago

Yes, under Test Cases. I will open a PR to put example usage up front

jnothman commented 3 years ago

Example is now up front, @amueller

lorentzenchr commented 3 years ago

I'm +1 for the general solution each consumer requests. Given the issues risen in https://github.com/scikit-learn/scikit-learn/pull/16079, I think we should be more flexible in the actual implementation and not bound by the letter of SLEP006, examples are naming (metadata vs. auxiliary data or something else), setting of default values for requested metadata and so on.

thomasjpfan commented 3 years ago

I am also +1 in general, but I would vote -1 because I think there are some updates that are needed. My biggest concern is how third party estimators with many fit parameters such as LGBMClassifier.fit have to do to adopt this SLEP. Yes, we can provide a request_all_params in BaseEstimator to help with the generic case, but the SLEP seems to advocate for adding request_* for every single param in fit. That would add 14 request_* methods to LGBMClassifier. We can also provide some nice automatic tools for this as well, but in the case of LGBMClassifer, it would bring the number of methods from 5 to 19.

adrinjalali commented 3 years ago

One of the reasons we haven't liked request_all_params is that it introduces silent bugs/changes when the underlying estimator changes the parameters. This means if an estimator doesn't support parameter X in verxion a.b.c, but supports it in a.b.c+1, and X is passed to the pipeline and used in some other steps of the pipeline, the behavior will be different based on the version of the library the user has installed. This is something we have advocated against so far.

thomasjpfan commented 3 years ago

What about request_fit_params(sample_weight=True, early_stopping_rounds=True, ...)?

adrinjalali commented 3 years ago

And how do you suggest we create this method? With some magical meta class? That's something we tried not to do since it'd be quite hard for many folks to understand, and implementing a separate method for each class would also be kinda dirty. FWIW, I wouldn't be opposed to going down the metaclass route.

thomasjpfan commented 3 years ago

To be concrete, I am thinking about the following use case:

grid = GridSearchCV(
    LGBMClassifier().request_sample_weights(fit=True).
                     request_eval_set(fit=True).
                     request_early_stopping_rounds(fit=True).
                     request_callbacks(fit=True), ...)

grid.fit(X, y, sample_weight= ....,
         eval_set=....,
         early_stopping_rounds=...,
         callbacks=....,
)

And if the following is better?

grid = GridSearchCV(
    LGBMClassifier().request_fit_params(sample_weight=True,
                                        eval_set=True,
                                        early_stopping=True,
                                        callbacks=True))

grid.fit(X, y, sample_weight= ....,
         eval_set=....,
         early_stopping_rounds=...,
         callbacks=....,

By design, the SLEP would force uses to list out the fit parameters twice. I was suggesting the request_all_params to try to make this example nicer.

FWIW, I wouldn't be opposed to going down the metaclass route.

With PEP487 & python 3.6, we have __init_subclass__ which I think would work as a mixin.

jnothman commented 3 years ago

@adrinjalali that's a good point re compatibility.

I'm okay with flipping it around and making it request_fit_params, request_predict_params, etc. Indeed, we could implement only request_fit_params initially and ignore others methods. Yes, this may require metaclass magic, or at least a helper make_request_params_method('fit', ['sample_weight']). Or whatever.

Confusing to now call them params. Is that the right name though?

I'm okay to take this back to the drawing board for another round of editing and voting, but we do need to find a way to make that turnaround quick and consensus-led, so that the next vote is more straightforward!

thomasjpfan commented 3 years ago

From the meeting: I agree that request_all_params is a not an option due to compatibility concerns and not being explicit.

Confusing to now call them params. Is that the right name though?

request_fit_kwargs ? This would separate it from get_params where the params are connected to __init__.

I'm okay to take this back to the drawing board for another round of editing and voting, but we do need to find a way to make that turnaround quick and consensus-led, so that the next vote is more straightforward!

I will have time next week to update the SLEP with the recent discussions. I will also have some time to work on the implementation of the proposed ideas. My aim is to get the SLEP ready before the next meeting.

thomasjpfan commented 3 years ago

I have an API question about this SLEP. Looking at the following example:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np

rng = np.random.RandomState(42)
X, y = make_classification(random_state=rng)
sample_weight = rng.random(len(X))

log_reg = LogisticRegression().request_sample_weight(fit=True)

# Sample weight is not passed in. Does this raise an error?
log_reg.fit(X, y)

@adrinjalali @jnothman What was the intended behavior of this interaction?

thomasjpfan commented 3 years ago

Another question, cross_validate uses metadata kwargs, while GridSearch.fit uses kwargs. How should a third party estimator developer decide which one to use?

thomasjpfan commented 3 years ago

There is also an asymmetry in the API. Estimators and groups use request_* and scoring uses a __init__ to configure the metadata request. Should we give scorer objects a request_* method too? Something like scorer.request_sample_weight(call=True).

Edit: It nicer putting the meta request in make_scorer, so this is most likely a bad idea.

Having request_metadata for make_scorer makes scorers a bit strange. Currently, our scorer is "just a callable". With this SLEP the scorer is a callable with a get_metadata_request method. The only way public api to construct it is with a make_scorer. There is no nice way to give a metadata request to a scorer without creating another scorer. This means a user would need to remember to pass in the other scorer specific metadata, such as greater_is_better=False.

adrinjalali commented 3 years ago

What was the intended behavior of this interaction?

Yes, in this case it would error, since the user has explicitly requested sample_weight. This is why I was adding a "default" or "optional" option to the {True, False} set.

adrinjalali commented 3 years ago

Another question, cross_validate uses metadata kwargs, while GridSearch.fit uses kwargs. How should a third party estimator developer decide which one to use?

kwargs are things passed to the underlying fit, whereas metadata is more explicit as what it is. I also have seen fit_params in some of our methods, which I deprecated in favor of metadata I think.

Having request_metadata for make_scorer makes scorers a bit strange. Currently, our scorer is "just a callable". With this SLEP the scorer is a callable with a get_metadata_request method. The only way public api to construct it is with a make_scorer. There is no nice way to give a metadata request to a scorer without creating another scorer.

I think we've thought of scorers as immutable objects so far. That means if the user want's to change one, they need to create a new one anyway.

This means a user would need to remember to pass in the other scorer specific metadata, such as greater_is_better=False.

Could you maybe clarify?

thomasjpfan commented 3 years ago

Could you maybe clarify?

My scoring concern is related to https://github.com/scikit-learn/scikit-learn/pull/16079#issuecomment-796054847

Let's say we have:

gs = GridSearchCV(LogisticRegression(), scoring="neg_log_loss")).fit(X, y)

If we want to turn on routing, the user needs to also know about greater_is_better

from sklearn.metrics import make_scorer
form sklearn.metrics import log_loss

weighted_neg_log_loss = make_scorer(log_loss, greater_is_better=False,
                                    request_metadata=["sample_weight"])
gs = GridSearchCV(LogisticRegression(), scoring=weighted_neg_log_loss)).fit(X, y)

It gets more complicated the more kwargs the original scorer needs. the extreme case is:

roc_auc_ovr_weighted_scorer = make_scorer(roc_auc_score, needs_proba=True,
                                          multi_class='ovr',
                                          average='weighted')

I think we can work around this and this point does not block the SLEP. I do not have any more concerns about the scorers.

Related to: https://github.com/scikit-learn/scikit-learn/pull/17962

thomasjpfan commented 3 years ago

kwargs are things passed to the underlying fit, whereas metadata is more explicit as what it is. I also have seen fit_params in some of our methods, which I deprecated in favor of metadata I think.

From the PR, only cross_validate and cross_val_score use an explicit props argument. The meta-estimators such as GridSearchCV uses fit_params.

This SLEP is broken down into two parts:

  1. How to configure consumers to have the required metadata
  2. How to configure meta-estimators or functions to pass the metadata to the consumers.

Currently, I am reworking Part 2 and thinking of a consistent story between functions and meta-estimators. Meta-estimators can keep track of state, so they can use a request_sample_weight method to have key aliases. Functions such as cross_validate do not have state, so need they need all the metadata passed in at once. I think that recommendation for third party libraries can be: Functions use metadata and everything else use **fit_params?

thomasjpfan commented 3 years ago

What is the expected behavior for fit_transform or fit_predict? I suspect we only pass in the metadata requested by fit, but that can be surprising.

adrinjalali commented 3 years ago

Functions use metadata and everything else use **fit_params?

The point is not all metadata is used for fit though, one could route metadata to predict and transform as well. Hence my preference not to use fit_params.

What is the expected behavior for fit_transform or fit_predict? I suspect we only pass in the metadata requested by fit, but that can be surprising.

It depends on what the requested params are. Estimators could also request params for transform and predict, in which case those metadata would be routed to those as well.

thomasjpfan commented 3 years ago

The point is not all metadata is used for fit though, one could route metadata to predict and transform as well. Hence my preference not to use fit_params.

I see the logic for functions. For meta-estimators such as GridSearchCV, how does one call fit and route scoring_weight + fitting_weight?

weighted_acc = make_scorer(
        accuracy_score, request_metadata={'scoring_weight': 'sample_weight'})
grid = GridSearchCV(...).request_sample_weight(fit={"fitting_weight": "sample_weight"})

# Is this how one would route the weights?
grid.fit(x, y, fitting_weight=fitting_weight, scoring_weight=scoring_weight)

Or do we want to be consistent with functions and have metadata here as well?

thomasjpfan commented 3 years ago

What should the behavior be for:

log_reg = LogisticRegression().request_sample_weight(fit={"fitting_weight": "sample_weight"})

# Option 1. does this error?
log_reg.fit(x, y, sample_weight=my_weight)

# Option 2. does this error?
log_reg.fit(x, y, fitting_weight=my_weight)
adrinjalali commented 3 years ago

For meta-estimators such as GridSearchCV, how does one call fit and route scoring_weight + fitting_weight?

Exactly as you've written it.

As for the other question:

# This is only relevant for meta-estimators, nothing changes when the user interacts with `log_reg` directly.
log_reg = LogisticRegression().request_sample_weight(fit={"fitting_weight": "sample_weight"})

# Option 1. does this error?
# No, this is how the user would call `fit` here.
log_reg.fit(x, y, sample_weight=my_weight)

# Option 2. does this error?
# Yes, the meta-estimators understand the routing, not estimators themselves.
log_reg.fit(x, y, fitting_weight=my_weight)
thomasjpfan commented 3 years ago

What is the API for Group CV in LogisticRegressionCV?

group_cv = GroupCV().request_groups(split={"fitting_groups": "groups"})

log_cv = (LogisticRegressionCV(cv=group_cv)
         .request_sample_weight(fit={"fitting_weight": "sample_weight"}))

# Option 1
log_cv.fit(X, y, sample_weight=sample_weight,
           metadata={"fitting_groups": "groups"})

# Option 2
log_cv.fit(X, y, sample_weight=sample_weight, groups=groups)
adrinjalali commented 3 years ago

What is the API for Group CV in LogisticRegressionCV?

# Option 2
log_cv.fit(X, y, sample_weight=sample_weight, fitting_groups=groups)

Why would it be passed through metadata? The idea is to pass everything through parameters as before when it comes to estimators and metaestimators. Anything you've noticed that makes you want to pass them as metadata? I think we discussed this and somehow agreed on passing as parameters, don't remember the whole discussion anymore :laughing:

thomasjpfan commented 3 years ago

Why would it be passed through metadata?

To see when we allow the alias to be a **kwarg. For cross_validate, it uses a metadata parameter for passing in groups, while for LogisticRegressionCV we pass it in through **kwargs.

In this case:

group_cv = GroupCV().request_groups(split={"fitting_groups": "groups"})
log_cv = (LogisticRegressionCV(cv=group_cv)
         .request_sample_weight(fit={"fitting_weight": "sample_weight"}))

log_cv.fit(X, y, sample_weight=sample_weight, fitting_groups=groups)

Both consumers have alias's but only the fitting_groups alias is used in fit. I see the underlying logic behind this, but I can also see how it can be confusing for users not familiar with the routing logic.

adrinjalali commented 3 years ago

Both consumers have alias's but only the fitting_groups alias is used in fit. I see the underlying logic behind this, but I can also see how it can be confusing for users not familiar with the routing logic.

Yes, but this is also a rather advanced usecase where the user wants to rename the metadata passed around. Most users would have something like:

group_cv = GroupCV().request_groups(split=True)
log_cv = (LogisticRegressionCV(cv=group_cv)
         .request_sample_weight(fit=True))

log_cv.fit(X, y, sample_weight=sample_weight, groups=groups)
jnothman commented 3 years ago

I admit that using this feature will sometimes be tricky for users. Perhaps there is a better solution, without a major burden on estimator implementation (which would be the case if we supporting aliasing when an estimator is used locally), but we've not really found it yet.

Let's make it clear what happens in each case with:

group_cv = GroupCV().request_groups(split={"fitting_groups": "groups"})
log_cv = (LogisticRegressionCV(cv=group_cv)
         .request_sample_weight(fit={"fitting_weight": "sample_weight"}))
pipe = make_pipeline(log_cv)
pipe2 = make_pipeline(StandardScaler().request_sample_weight(fit=True), log_cv)
Which estimator Passed sample_weight Passed groups Passed fitting_weight Passed fitting_groups Outcome
log_cv y n n y Passes, weighted
log_cv y n n n Fails since group_cv requires groups
log_cv n n n y Passes, not weighted
log_cv n n n n Fails since group_cv requires groups
log_cv * y n * Fails: unexpected kwarg
log_cv * * y * Fails: unexpected kwarg
pipe y * * * Fails: unexpected kwarg
pipe * y * * Fails: unexpected kwarg
pipe n n * n Fails since group_cv requires groups
pipe n n n y Passes, not weighted
pipe n n y y Passes, weighted
pipe2 y n y y Passes, both weighted
pipe2 n n y y log_cv weighted, no error
pipe2 y n n y scaler weighted, no error

Is this reasonable behaviour? Can we improve on it?

(Written in a hurry, please feel free to reorder my rows so that they appear more systematic! I tried to remove some tediously redundant cases....)

adrinjalali commented 3 years ago

I vaguely remember that if a user requests something, it should be there, unless the user makes it optional:

group_cv = GroupCV().request_groups(split={"fitting_groups": "groups"})
log_cv = (LogisticRegressionCV(cv=group_cv)
         .request_sample_weight(fit={"fitting_weight": "sample_weight"}))
pipe = make_pipeline(log_cv)
pipe2 = make_pipeline(StandardScaler().request_sample_weight(fit=True), log_cv)
pipe3 = make_pipeline(StandardScaler().request_sample_weight(fit="optional"), log_cv)
pipe4 = make_pipeline(StandardScaler().request_sample_weight(fit=True),
                      LogisticRegressionCV(cv=group_cv))
Which estimator Passed sample_weight Passed groups Passed fitting_weight Passed fitting_groups Outcome
pipe2 y n y y Passes, both weighted
pipe2 n n y y error, scaler requires weights
pipe2 y n n y error, log_cv requires weights
pipe3 n n y y Passes, weighted log_cv
pipe4 * * * n fails, groupcv needs groups
pipe4 n * * y passes, not weighted
pipe4 y * * y fails, sample_weight passed but LRCV is not clear about what it needs

In our discussions we agreed that if there's a step in a pipeline which accepts a kwarg but the user has not explicitly set whether it needs it or not, and another step requires it and is passed, then we should fail and ask the user to be explicit. Therefore the user should write pipe4 as:

pipe4 = make_pipeline(StandardScaler().request_sample_weight(fit=True),
                      LogisticRegressionCV(cv=group_cv).request_sample_weight(fit=False))
# or
pipe4 = make_pipeline(StandardScaler().request_sample_weight(fit=True),
                      LogisticRegressionCV(cv=group_cv).request_sample_weight(fit="optional"))
jnothman commented 3 years ago

I don't recall a discussion of 'optional'. It's certainly not found its way into the SLEP, and the syntax you've noted (request_sample_weight(fit="optional")) is problematically ambiguous.

As far as I'm concerned, if an estimator request is there, the fit kwarg may still be absent. The estimator will then raise an error if it was required. Perhaps there are more reasons, but these are the first I think of:

adrinjalali commented 3 years ago

If I recall correctly, @agramfort didn't like the idea of an estimator requesting a property, and when it's not present not giving an error. He argued for things to be very explicit. This is why we ended up with having to include another state which is the default, to detect cases where one estimator has requested sample_weight and the other one not, and the user may not realize that one of them is not working with the weights.

Then we had the issue that if the user wants to run a grid search with and w/o sample weights for an estimator, they can't do with the above constraint since in one of those cases it'd error. That's why the idea of "optional" was born, to kinda be explicit about requiring or not requiring a metadata like sample_weight.

jnothman commented 3 years ago

I don't recall ever motivating this "optional" feature. I understand the request to always be optional.

request_sample_weight(fit="blah") means "when my parent estimator is calling my fit our fit_* or partial_fit method and has a parameter called blah, it should pass it to me as sample_weight. Alex's innovation was about being more defensive around ambiguity in the absence of a request. "Optional" vs "required" would not be about reducing ambiguity but about helping users to avoid errors in their code where they forget to pass a parameter

adrinjalali commented 3 years ago

I don't recall ever motivating this "optional" feature. I understand the request to always be optional.

I'm happy if we all agree that it's always optional. I don't have a strong preference either way.

Perhaps there is a better solution, without a major burden on estimator implementation (which would be the case if we supporting aliasing when an estimator is used locally), but we've not really found it yet.

We could go back to the idea of passing around X, y, metadata instead of only X, y everywhere, and have estimators themselves get what they need from metadata if present, and add one or two meta-estimators to filter and rename keys in metadata if necessary for advanced cases :grimacing: It's having to be explicit which doesn't allow us to do simpler solutions.

For instance, we do change estimators' behaviors all the time, albeit with warnings, and users don't read or take care of those warnings. This means for many folks out there their code runs differently between sklearn versions w/o them realizing it, and the world doesn't collapse :shrug:

jnothman commented 3 years ago

I'm less concerned about backwards compatibility than explicitness in user code.

jnothman commented 3 years ago

In any case I'm interested to hear @thomasjpfan's perspective on these details of the proposal.

jnothman commented 3 years ago

It's having to be explicit which doesn't allow us to do simpler solutions.

Well, no, we could be explicit and pass around all pieces of data, but we would have no way to track if some piece of data was never used, e.g. due to typo, and we would require the addition of a metadata param to every estimator's fit etc method involved directly or indirectly with this metadata...

thomasjpfan commented 3 years ago

I agree with the matrix in https://github.com/scikit-learn/enhancement_proposals/pull/52#issuecomment-821815740 and not having an "optional".

Most of my thoughts are now summarized in #55.

jnothman commented 3 years ago

Closing for now