scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

Rewrite SLEP006 to be easier to read and vote on #55

Closed thomasjpfan closed 2 years ago

thomasjpfan commented 3 years ago

Toward https://github.com/scikit-learn/enhancement_proposals/pull/52

Here is a rendered version of the changes.

The goal of this rewrite is to reduce the length of the SLEP, but still capturing the points of the original version.

  1. Introduces producers as objects that product the metadata for the consumers.
  2. Changes the SLEP to have two methods get_metadata_request and set_metadata_request.

@adrinjalali Can you show me a case where we need sets in get_metadata_request?

As noted in https://github.com/scikit-learn/enhancement_proposals/pull/52, I am unsure about having only cross_validate and cross_val_score accepting metadata. Everything else, such as GridSearchCV will be using **kwargs.

CC @jnothman

adrinjalali commented 3 years ago

Can you show me a case where we need sets in get_metadata_request?

This is what I had in mind:

pipe = make_pipeline(
    Transformer().request_groups(fit='sensitive_attribute'),
    Predictor().request_sensitive_attribute(fit=True)
)

However, this would be masked in get_metadata_request:

pipe.get_metadata_request()["fit"]
{'sensitive_attribute': 'sensitive_attribute'}

and is only used in the routing, and we can handle routing in a way other than how it is done in my prototype. We can probably do w/o it, I think, not sure.

GaelVaroquaux commented 3 years ago

est.fit_request(sample_weight='my_sw') est.score_request(sample_weight='score_sw')

I think that this is a beautiful solution!!!

Would it be interesting to consider est.fit_request(sample_weight=True) as a equivalent of est.fit_request(sample_weight="sample_weight"), to follow the design principle of convention over configuration (we don't want to be the zope of data science)?

thomasjpfan commented 3 years ago

Would it be interesting to consider est.fit_request(sample_weight=True) as a equivalent of est.fit_request(sample_weight="sample_weight")

For this SLEP, I think we are considering them equivalent.

thomasjpfan commented 3 years ago

I am concerned about the difference between fit_request(sample_weight=False) and fit_request(sample_weight=None).

I am still a little uncomfortable with the consumer not checking for the existence of the metadata. Was the following something that was discussed?

log_reg_true = LogisticRegression().fit_request(sample_weight=True)
# raises an error 
log_reg_true.fit(X, y)

log_reg_false = LogisticRegression().fit_request(sample_weight=False)
# raises an error 
log_reg_false.fit(X, y, sample_weight=sample_weight)

log_reg_none = LogisticRegression().fit_request(sample_weight=None)

# both versions are okay
log_reg_none.fit(X, y, sample_weight=sample_weight)
log_reg_none.fit(X, y)
thomasjpfan commented 3 years ago

@adrinjalali To make sure I understand correctly, is this the

pipe = make_pipeline(
    Transformer().request_groups(fit='sensitive_attribute'),
    Predictor().request_sensitive_attribute(fit=True)
)

pipe.get_metadata_requests()["fit"]
{"sensitive_attribute": set("sensitive_attribute", "groups"))}

# This would tell a consumer to do this:
pipe.fit(X, y, sensitive_attribute=metadata["sensitive_attribute"],
         groups=metadata["sensitive_attribute"])

This is kind of why I thought of reversing the representation:

pipe.get_metadata_requests()["fit"]
{"groups": "sensitive_attribute",
 "sensitive_attribute": "sensitive_attribute"}

The keys of this dictionary maps directly to the kwargs in fit.

jnothman commented 3 years ago

I was trying to simplify the story by having metaestimators be a consumer

My intention was that a consumer is distinguished since it handles the interpretaition of input metadata. This is in contrast with routers (note that LRCV is not a metaestimator but is responsible for routing to cv and scoring) whose is to know how to pass their metadata on (and in the case of CV routers to know how to split that metadata, i.e. deal with its structure but not it interpretation). IMO this distinction is an important one. I'm not really into producer as a thing yet; usually the producer is the user, IMO, or it could be a Resampler. But my understanding of these things is apparently far from patent in the SLEP.

jnothman commented 3 years ago

est.fit_request(sample_weight='my_sw')

I don't like using the name fit_* for a method that doesn't fit the model. I'd rather set_fit_request.

This is kind of why I thought of reversing the representation:

pipe.get_metadata_requests()["fit"]
{"groups": "sensitive_attribute",
"sensitive_attribute": "sensitive_attribute"}

I like this. It's also a good to support the case of erroring if a supported metadata element is passed but not requested.

In https://github.com/scikit-learn/enhancement_proposals/pull/55#issuecomment-830863662:

fit_request(sample_weight=False): producer should pass sample_weight to the consumer.

I think this is a typo and should say "should not pass"

you shall not pass

I am still a little uncomfortable with the consumer not checking for the existence of the metadata.

The current approach was designed to place no additional implementation burden upon the consumer, beyond what would be facilitated, ideally, by some kind of mixin that would create the request_sample_weight method. Do we want to place that burden on estimator developers? Could we roll it into _validate_data?

We could make it optional for an estimator to check that its input is consistent with its request...?

adrinjalali commented 3 years ago

This is kind of why I thought of reversing the representation:

pipe.get_metadata_requests()["fit"]
{"groups": "sensitive_attribute",
 "sensitive_attribute": "sensitive_attribute"}

The keys of this dictionary maps directly to the kwargs in fit.

Yeah I'm fine with this.

I don't like using the name fit_* for a method that doesn't fit the model. I'd rather set_fit_request.

I think fit_requests (as opposed to fit_request) is quite clear though

I am concerned about the difference between fit_request(sample_weight=False) and fit_request(sample_weight=None).

  • fit_request(sample_weight=False): producer should pass sample_weight to the consumer.
  • fit_request(sample_weight=None): producer may or may not pass sample_weight to the consumer.

To correct this:

thomasjpfan commented 3 years ago

@adrinjalali I am confused with the semantics of None. Can you provide a code example where the behavior of False and None differs?

adrinjalali commented 3 years ago

@adrinjalali I am confused with the semantics of None. Can you provide a code example where the behavior of False and None differs?

# implicit None for sample_weight on SVC
make_pipeline(
    StandardScaler().fit_requests(sample_weight=True),
    SVC()
).fit(X, y, sample_weight)
# raises: sample_weight is provided but not explicitly set as required or not for some of the estimators

make_pipeline(
    StandardScaler().fit_requests(sample_weight=True),
    SVC().fit_requests(sample_weight=False)
).fit(X, y, sample_weight)
# passes
jnothman commented 3 years ago

I am confused with the semantics of None.

FWIW, I don't think it needs to be represented by None (another sentinel will do), but it's the default behaviour when no request has been set for that key.

I think fit_requests (as opposed to fit_request) is quite clear though

It's an improvement, but I think not so easy for users to remember; and generally methods are better when they start with verbs (although a method that returns self can sometimes start with a preposition, e.g. with_fit_params). But everything I come up with is quite a long name, such as request_for_fit...

thomasjpfan commented 3 years ago

request_for_fit is only 4 chars longer than than fit_requests, which I think is okay. I updated the PR to use request_for_fit, and have make_scorer keep request_metadata (for now).

The current approach was designed to place no additional implementation burden upon the consumer,

In the SLEPs current form, we are putting all the burden on the producer, which makes it a little harder to create meta-estimators. My sense is that there are more consumers out there than meta-estimators, so I think this is an okay trade off.

jnothman commented 3 years ago

I continue to find "producer" confusing. The role of the meta estimator here is to route or distribute or provide, not to produce.

It's not just a matter of meta estimators in the wild being less common, it's also that developing a meta estimator already has quite a high bar to entry, so the relative increase in burden is much smaller than if we told estimator developers they needed to jump through more hoops to support sample weights in a Pipeline. Non compliant sample_weight implementations would also dissuade users from pipeline. So adding to Consumer implementation burden is, imo, a bad idea.

thomasjpfan commented 3 years ago

I continue to find "producer" confusing. The role of the meta estimator here is to route or distribute or provide, not to produce.

What do you think of "router"?

jnothman commented 3 years ago

I like "router" or "distributor". I had intended to use "router" earlier in the SLEP but I felt at some point that more terminology was not helpful. I appreciate your feedback that such terminology is helpful!

thomasjpfan commented 2 years ago

@glemaitre I need to update this SLEP to keep track of the developments in https://github.com/scikit-learn/scikit-learn/pull/21284