Closed jnothman closed 3 years ago
+1
This is a stupid question but where exactly is the proposal for the user code? Should I look at the PR? the "test cases" section of the slep?
Yes, under Test Cases. I will open a PR to put example usage up front
Example is now up front, @amueller
I'm +1 for the general solution each consumer requests. Given the issues risen in https://github.com/scikit-learn/scikit-learn/pull/16079, I think we should be more flexible in the actual implementation and not bound by the letter of SLEP006, examples are naming (metadata vs. auxiliary data or something else), setting of default values for requested metadata and so on.
I am also +1 in general, but I would vote -1 because I think there are some updates that are needed. My biggest concern is how third party estimators with many fit parameters such as LGBMClassifier.fit have to do to adopt this SLEP. Yes, we can provide a request_all_params
in BaseEstimator
to help with the generic case, but the SLEP seems to advocate for adding request_*
for every single param in fit
. That would add 14 request_*
methods to LGBMClassifier
. We can also provide some nice automatic tools for this as well, but in the case of LGBMClassifer
, it would bring the number of methods from 5 to 19.
One of the reasons we haven't liked request_all_params
is that it introduces silent bugs/changes when the underlying estimator changes the parameters. This means if an estimator doesn't support parameter X in verxion a.b.c, but supports it in a.b.c+1, and X is passed to the pipeline and used in some other steps of the pipeline, the behavior will be different based on the version of the library the user has installed. This is something we have advocated against so far.
What about request_fit_params(sample_weight=True, early_stopping_rounds=True, ...)
?
And how do you suggest we create this method? With some magical meta class? That's something we tried not to do since it'd be quite hard for many folks to understand, and implementing a separate method for each class would also be kinda dirty. FWIW, I wouldn't be opposed to going down the metaclass route.
To be concrete, I am thinking about the following use case:
grid = GridSearchCV(
LGBMClassifier().request_sample_weights(fit=True).
request_eval_set(fit=True).
request_early_stopping_rounds(fit=True).
request_callbacks(fit=True), ...)
grid.fit(X, y, sample_weight= ....,
eval_set=....,
early_stopping_rounds=...,
callbacks=....,
)
And if the following is better?
grid = GridSearchCV(
LGBMClassifier().request_fit_params(sample_weight=True,
eval_set=True,
early_stopping=True,
callbacks=True))
grid.fit(X, y, sample_weight= ....,
eval_set=....,
early_stopping_rounds=...,
callbacks=....,
By design, the SLEP would force uses to list out the fit parameters twice. I was suggesting the request_all_params
to try to make this example nicer.
FWIW, I wouldn't be opposed to going down the metaclass route.
With PEP487 & python 3.6, we have __init_subclass__
which I think would work as a mixin.
@adrinjalali that's a good point re compatibility.
I'm okay with flipping it around and making it request_fit_params
, request_predict_params
, etc. Indeed, we could implement only request_fit_params
initially and ignore others methods. Yes, this may require metaclass magic, or at least a helper make_request_params_method('fit', ['sample_weight'])
. Or whatever.
Confusing to now call them params. Is that the right name though?
I'm okay to take this back to the drawing board for another round of editing and voting, but we do need to find a way to make that turnaround quick and consensus-led, so that the next vote is more straightforward!
From the meeting: I agree that request_all_params
is a not an option due to compatibility concerns and not being explicit.
Confusing to now call them params. Is that the right name though?
request_fit_kwargs
? This would separate it from get_params
where the params are connected to __init__
.
I'm okay to take this back to the drawing board for another round of editing and voting, but we do need to find a way to make that turnaround quick and consensus-led, so that the next vote is more straightforward!
I will have time next week to update the SLEP with the recent discussions. I will also have some time to work on the implementation of the proposed ideas. My aim is to get the SLEP ready before the next meeting.
I have an API question about this SLEP. Looking at the following example:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np
rng = np.random.RandomState(42)
X, y = make_classification(random_state=rng)
sample_weight = rng.random(len(X))
log_reg = LogisticRegression().request_sample_weight(fit=True)
# Sample weight is not passed in. Does this raise an error?
log_reg.fit(X, y)
@adrinjalali @jnothman What was the intended behavior of this interaction?
Another question, cross_validate
uses metadata
kwargs, while GridSearch.fit
uses kwargs
. How should a third party estimator developer decide which one to use?
There is also an asymmetry in the API. Estimators and groups use request_*
and scoring uses a __init__
to configure the metadata request. Should we give scorer
objects a request_*
method too? Something like scorer.request_sample_weight(call=True)
.
Edit: It nicer putting the meta request in make_scorer
, so this is most likely a bad idea.
Having request_metadata
for make_scorer
makes scorers a bit strange. Currently, our scorer is "just a callable". With this SLEP the scorer is a callable with a get_metadata_request
method. The only way public api to construct it is with a make_scorer
. There is no nice way to give a metadata request to a scorer without creating another scorer. This means a user would need to remember to pass in the other scorer specific metadata, such as greater_is_better=False
.
What was the intended behavior of this interaction?
Yes, in this case it would error, since the user has explicitly requested sample_weight
. This is why I was adding a "default" or "optional" option to the {True, False}
set.
Another question,
cross_validate
usesmetadata
kwargs, whileGridSearch.fit
useskwargs
. How should a third party estimator developer decide which one to use?
kwargs
are things passed to the underlying fit
, whereas metadata
is more explicit as what it is. I also have seen fit_params
in some of our methods, which I deprecated in favor of metadata
I think.
Having
request_metadata
formake_scorer
makes scorers a bit strange. Currently, our scorer is "just a callable". With this SLEP the scorer is a callable with aget_metadata_request
method. The only way public api to construct it is with amake_scorer
. There is no nice way to give a metadata request to a scorer without creating another scorer.
I think we've thought of scorers as immutable objects so far. That means if the user want's to change one, they need to create a new one anyway.
This means a user would need to remember to pass in the other scorer specific metadata, such as
greater_is_better=False
.
Could you maybe clarify?
Could you maybe clarify?
My scoring concern is related to https://github.com/scikit-learn/scikit-learn/pull/16079#issuecomment-796054847
Let's say we have:
gs = GridSearchCV(LogisticRegression(), scoring="neg_log_loss")).fit(X, y)
If we want to turn on routing, the user needs to also know about greater_is_better
from sklearn.metrics import make_scorer
form sklearn.metrics import log_loss
weighted_neg_log_loss = make_scorer(log_loss, greater_is_better=False,
request_metadata=["sample_weight"])
gs = GridSearchCV(LogisticRegression(), scoring=weighted_neg_log_loss)).fit(X, y)
It gets more complicated the more kwargs the original scorer needs. the extreme case is:
roc_auc_ovr_weighted_scorer = make_scorer(roc_auc_score, needs_proba=True,
multi_class='ovr',
average='weighted')
I think we can work around this and this point does not block the SLEP. I do not have any more concerns about the scorers.
Related to: https://github.com/scikit-learn/scikit-learn/pull/17962
kwargs are things passed to the underlying fit, whereas metadata is more explicit as what it is. I also have seen fit_params in some of our methods, which I deprecated in favor of metadata I think.
From the PR, only cross_validate
and cross_val_score
use an explicit props
argument. The meta-estimators such as GridSearchCV
uses fit_params
.
This SLEP is broken down into two parts:
Currently, I am reworking Part 2 and thinking of a consistent story between functions and meta-estimators. Meta-estimators can keep track of state, so they can use a request_sample_weight
method to have key aliases. Functions such as cross_validate
do not have state, so need they need all the metadata passed in at once. I think that recommendation for third party libraries can be: Functions use metadata
and everything else use **fit_params
?
What is the expected behavior for fit_transform
or fit_predict
? I suspect we only pass in the metadata requested by fit
, but that can be surprising.
Functions use
metadata
and everything else use**fit_params
?
The point is not all metadata is used for fit
though, one could route metadata to predict
and transform
as well. Hence my preference not to use fit_params
.
What is the expected behavior for
fit_transform
orfit_predict
? I suspect we only pass in the metadata requested byfit
, but that can be surprising.
It depends on what the requested params are. Estimators could also request params for transform and predict, in which case those metadata would be routed to those as well.
The point is not all metadata is used for fit though, one could route metadata to predict and transform as well. Hence my preference not to use fit_params.
I see the logic for functions. For meta-estimators such as GridSearchCV
, how does one call fit and route scoring_weight
+ fitting_weight
?
weighted_acc = make_scorer(
accuracy_score, request_metadata={'scoring_weight': 'sample_weight'})
grid = GridSearchCV(...).request_sample_weight(fit={"fitting_weight": "sample_weight"})
# Is this how one would route the weights?
grid.fit(x, y, fitting_weight=fitting_weight, scoring_weight=scoring_weight)
Or do we want to be consistent with functions and have metadata
here as well?
What should the behavior be for:
log_reg = LogisticRegression().request_sample_weight(fit={"fitting_weight": "sample_weight"})
# Option 1. does this error?
log_reg.fit(x, y, sample_weight=my_weight)
# Option 2. does this error?
log_reg.fit(x, y, fitting_weight=my_weight)
For meta-estimators such as
GridSearchCV
, how does one call fit and routescoring_weight
+fitting_weight
?
Exactly as you've written it.
As for the other question:
# This is only relevant for meta-estimators, nothing changes when the user interacts with `log_reg` directly.
log_reg = LogisticRegression().request_sample_weight(fit={"fitting_weight": "sample_weight"})
# Option 1. does this error?
# No, this is how the user would call `fit` here.
log_reg.fit(x, y, sample_weight=my_weight)
# Option 2. does this error?
# Yes, the meta-estimators understand the routing, not estimators themselves.
log_reg.fit(x, y, fitting_weight=my_weight)
What is the API for Group CV in LogisticRegressionCV
?
group_cv = GroupCV().request_groups(split={"fitting_groups": "groups"})
log_cv = (LogisticRegressionCV(cv=group_cv)
.request_sample_weight(fit={"fitting_weight": "sample_weight"}))
# Option 1
log_cv.fit(X, y, sample_weight=sample_weight,
metadata={"fitting_groups": "groups"})
# Option 2
log_cv.fit(X, y, sample_weight=sample_weight, groups=groups)
What is the API for Group CV in
LogisticRegressionCV
?
# Option 2
log_cv.fit(X, y, sample_weight=sample_weight, fitting_groups=groups)
Why would it be passed through metadata
? The idea is to pass everything through parameters as before when it comes to estimators and metaestimators. Anything you've noticed that makes you want to pass them as metadata
? I think we discussed this and somehow agreed on passing as parameters, don't remember the whole discussion anymore :laughing:
Why would it be passed through metadata?
To see when we allow the alias to be a **kwarg
. For cross_validate
, it uses a metadata
parameter for passing in groups, while for LogisticRegressionCV
we pass it in through **kwargs
.
In this case:
group_cv = GroupCV().request_groups(split={"fitting_groups": "groups"})
log_cv = (LogisticRegressionCV(cv=group_cv)
.request_sample_weight(fit={"fitting_weight": "sample_weight"}))
log_cv.fit(X, y, sample_weight=sample_weight, fitting_groups=groups)
Both consumers have alias's but only the fitting_groups
alias is used in fit
. I see the underlying logic behind this, but I can also see how it can be confusing for users not familiar with the routing logic.
Both consumers have alias's but only the
fitting_groups
alias is used infit
. I see the underlying logic behind this, but I can also see how it can be confusing for users not familiar with the routing logic.
Yes, but this is also a rather advanced usecase where the user wants to rename the metadata passed around. Most users would have something like:
group_cv = GroupCV().request_groups(split=True)
log_cv = (LogisticRegressionCV(cv=group_cv)
.request_sample_weight(fit=True))
log_cv.fit(X, y, sample_weight=sample_weight, groups=groups)
I admit that using this feature will sometimes be tricky for users. Perhaps there is a better solution, without a major burden on estimator implementation (which would be the case if we supporting aliasing when an estimator is used locally), but we've not really found it yet.
Let's make it clear what happens in each case with:
group_cv = GroupCV().request_groups(split={"fitting_groups": "groups"})
log_cv = (LogisticRegressionCV(cv=group_cv)
.request_sample_weight(fit={"fitting_weight": "sample_weight"}))
pipe = make_pipeline(log_cv)
pipe2 = make_pipeline(StandardScaler().request_sample_weight(fit=True), log_cv)
Which estimator | Passed sample_weight |
Passed groups |
Passed fitting_weight |
Passed fitting_groups |
Outcome |
---|---|---|---|---|---|
log_cv | y | n | n | y | Passes, weighted |
log_cv | y | n | n | n | Fails since group_cv requires groups |
log_cv | n | n | n | y | Passes, not weighted |
log_cv | n | n | n | n | Fails since group_cv requires groups |
log_cv | * | y | n | * | Fails: unexpected kwarg |
log_cv | * | * | y | * | Fails: unexpected kwarg |
pipe | y | * | * | * | Fails: unexpected kwarg |
pipe | * | y | * | * | Fails: unexpected kwarg |
pipe | n | n | * | n | Fails since group_cv requires groups |
pipe | n | n | n | y | Passes, not weighted |
pipe | n | n | y | y | Passes, weighted |
pipe2 | y | n | y | y | Passes, both weighted |
pipe2 | n | n | y | y | log_cv weighted, no error |
pipe2 | y | n | n | y | scaler weighted, no error |
Is this reasonable behaviour? Can we improve on it?
(Written in a hurry, please feel free to reorder my rows so that they appear more systematic! I tried to remove some tediously redundant cases....)
I vaguely remember that if a user requests something, it should be there, unless the user makes it optional:
group_cv = GroupCV().request_groups(split={"fitting_groups": "groups"})
log_cv = (LogisticRegressionCV(cv=group_cv)
.request_sample_weight(fit={"fitting_weight": "sample_weight"}))
pipe = make_pipeline(log_cv)
pipe2 = make_pipeline(StandardScaler().request_sample_weight(fit=True), log_cv)
pipe3 = make_pipeline(StandardScaler().request_sample_weight(fit="optional"), log_cv)
pipe4 = make_pipeline(StandardScaler().request_sample_weight(fit=True),
LogisticRegressionCV(cv=group_cv))
Which estimator | Passed sample_weight |
Passed groups |
Passed fitting_weight |
Passed fitting_groups |
Outcome |
---|---|---|---|---|---|
pipe2 | y | n | y | y | Passes, both weighted |
pipe2 | n | n | y | y | error, scaler requires weights |
pipe2 | y | n | n | y | error, log_cv requires weights |
pipe3 | n | n | y | y | Passes, weighted log_cv |
pipe4 | * | * | * | n | fails, groupcv needs groups |
pipe4 | n | * | * | y | passes, not weighted |
pipe4 | y | * | * | y | fails, sample_weight passed but LRCV is not clear about what it needs |
In our discussions we agreed that if there's a step in a pipeline which accepts a kwarg but the user has not explicitly set whether it needs it or not, and another step requires it and is passed, then we should fail and ask the user to be explicit. Therefore the user should write pipe4
as:
pipe4 = make_pipeline(StandardScaler().request_sample_weight(fit=True),
LogisticRegressionCV(cv=group_cv).request_sample_weight(fit=False))
# or
pipe4 = make_pipeline(StandardScaler().request_sample_weight(fit=True),
LogisticRegressionCV(cv=group_cv).request_sample_weight(fit="optional"))
I don't recall a discussion of 'optional'. It's certainly not found its way into the SLEP, and the syntax you've noted (request_sample_weight(fit="optional")
) is problematically ambiguous.
As far as I'm concerned, if an estimator request is there, the fit kwarg may still be absent. The estimator will then raise an error if it was required. Perhaps there are more reasons, but these are the first I think of:
If I recall correctly, @agramfort didn't like the idea of an estimator requesting a property, and when it's not present not giving an error. He argued for things to be very explicit. This is why we ended up with having to include another state which is the default, to detect cases where one estimator has requested sample_weight
and the other one not, and the user may not realize that one of them is not working with the weights.
Then we had the issue that if the user wants to run a grid search with and w/o sample weights for an estimator, they can't do with the above constraint since in one of those cases it'd error. That's why the idea of "optional"
was born, to kinda be explicit about requiring or not requiring a metadata like sample_weight
.
I don't recall ever motivating this "optional" feature. I understand the request to always be optional.
request_sample_weight(fit="blah")
means "when my parent estimator is calling my fit our fit_* or partial_fit method and has a parameter called blah, it should pass it to me as sample_weight
. Alex's innovation was about being more defensive around ambiguity in the absence of a request. "Optional" vs "required" would not be about reducing ambiguity but about helping users to avoid errors in their code where they forget to pass a parameter
I don't recall ever motivating this "optional" feature. I understand the request to always be optional.
I'm happy if we all agree that it's always optional. I don't have a strong preference either way.
Perhaps there is a better solution, without a major burden on estimator implementation (which would be the case if we supporting aliasing when an estimator is used locally), but we've not really found it yet.
We could go back to the idea of passing around X, y, metadata
instead of only X, y
everywhere, and have estimators themselves get
what they need from metadata
if present, and add one or two meta-estimators to filter and rename keys in metadata
if necessary for advanced cases :grimacing: It's having to be explicit which doesn't allow us to do simpler solutions.
For instance, we do change estimators' behaviors all the time, albeit with warnings, and users don't read or take care of those warnings. This means for many folks out there their code runs differently between sklearn versions w/o them realizing it, and the world doesn't collapse :shrug:
I'm less concerned about backwards compatibility than explicitness in user code.
In any case I'm interested to hear @thomasjpfan's perspective on these details of the proposal.
It's having to be explicit which doesn't allow us to do simpler solutions.
Well, no, we could be explicit and pass around all pieces of data, but we would have no way to track if some piece of data was never used, e.g. due to typo, and we would require the addition of a metadata param to every estimator's fit
etc method involved directly or indirectly with this metadata...
I agree with the matrix in https://github.com/scikit-learn/enhancement_proposals/pull/52#issuecomment-821815740 and not having an "optional".
Most of my thoughts are now summarized in #55.
Closing for now
This PR proposes to "Accept" SLEP006.