Closed thomasjpfan closed 2 years ago
Can you show me a case where we need sets in
get_metadata_request
?
This is what I had in mind:
pipe = make_pipeline(
Transformer().request_groups(fit='sensitive_attribute'),
Predictor().request_sensitive_attribute(fit=True)
)
However, this would be masked in get_metadata_request
:
pipe.get_metadata_request()["fit"]
{'sensitive_attribute': 'sensitive_attribute'}
and is only used in the routing, and we can handle routing in a way other than how it is done in my prototype. We can probably do w/o it, I think, not sure.
est.fit_request(sample_weight='my_sw') est.score_request(sample_weight='score_sw')
I think that this is a beautiful solution!!!
Would it be interesting to consider est.fit_request(sample_weight=True)
as a equivalent of est.fit_request(sample_weight="sample_weight")
, to
follow the design principle of convention over configuration (we don't
want to be the zope of data science)?
Would it be interesting to consider
est.fit_request(sample_weight=True)
as a equivalent ofest.fit_request(sample_weight="sample_weight")
For this SLEP, I think we are considering them equivalent.
I am concerned about the difference between fit_request(sample_weight=False)
and fit_request(sample_weight=None)
.
fit_request(sample_weight=False)
: producer should pass sample_weight
to the consumer.fit_request(sample_weight=None)
: producer may or may not pass sample_weight
to the consumer.I am still a little uncomfortable with the consumer not checking for the existence of the metadata. Was the following something that was discussed?
log_reg_true = LogisticRegression().fit_request(sample_weight=True)
# raises an error
log_reg_true.fit(X, y)
log_reg_false = LogisticRegression().fit_request(sample_weight=False)
# raises an error
log_reg_false.fit(X, y, sample_weight=sample_weight)
log_reg_none = LogisticRegression().fit_request(sample_weight=None)
# both versions are okay
log_reg_none.fit(X, y, sample_weight=sample_weight)
log_reg_none.fit(X, y)
@adrinjalali To make sure I understand correctly, is this the
pipe = make_pipeline(
Transformer().request_groups(fit='sensitive_attribute'),
Predictor().request_sensitive_attribute(fit=True)
)
pipe.get_metadata_requests()["fit"]
{"sensitive_attribute": set("sensitive_attribute", "groups"))}
# This would tell a consumer to do this:
pipe.fit(X, y, sensitive_attribute=metadata["sensitive_attribute"],
groups=metadata["sensitive_attribute"])
This is kind of why I thought of reversing the representation:
pipe.get_metadata_requests()["fit"]
{"groups": "sensitive_attribute",
"sensitive_attribute": "sensitive_attribute"}
The keys of this dictionary maps directly to the kwargs
in fit
.
I was trying to simplify the story by having metaestimators be a consumer
My intention was that a consumer is distinguished since it handles the interpretaition of input metadata. This is in contrast with routers (note that LRCV is not a metaestimator but is responsible for routing to cv
and scoring
) whose is to know how to pass their metadata on (and in the case of CV routers to know how to split that metadata, i.e. deal with its structure but not it interpretation). IMO this distinction is an important one. I'm not really into producer as a thing yet; usually the producer is the user, IMO, or it could be a Resampler. But my understanding of these things is apparently far from patent in the SLEP.
est.fit_request(sample_weight='my_sw')
I don't like using the name fit_*
for a method that doesn't fit the model. I'd rather set_fit_request
.
This is kind of why I thought of reversing the representation:
pipe.get_metadata_requests()["fit"] {"groups": "sensitive_attribute", "sensitive_attribute": "sensitive_attribute"}
I like this. It's also a good to support the case of erroring if a supported metadata element is passed but not requested.
In https://github.com/scikit-learn/enhancement_proposals/pull/55#issuecomment-830863662:
fit_request(sample_weight=False): producer should pass sample_weight to the consumer.
I think this is a typo and should say "should not pass"
I am still a little uncomfortable with the consumer not checking for the existence of the metadata.
The current approach was designed to place no additional implementation burden upon the consumer, beyond what would be facilitated, ideally, by some kind of mixin that would create the request_sample_weight
method. Do we want to place that burden on estimator developers? Could we roll it into _validate_data
?
We could make it optional for an estimator to check that its input is consistent with its request...?
This is kind of why I thought of reversing the representation:
pipe.get_metadata_requests()["fit"] {"groups": "sensitive_attribute", "sensitive_attribute": "sensitive_attribute"}
The keys of this dictionary maps directly to the
kwargs
infit
.
Yeah I'm fine with this.
I don't like using the name
fit_*
for a method that doesn't fit the model. I'd ratherset_fit_request
.
I think fit_requests
(as opposed to fit_request
) is quite clear though
I am concerned about the difference between
fit_request(sample_weight=False)
andfit_request(sample_weight=None)
.
fit_request(sample_weight=False)
: producer should passsample_weight
to the consumer.fit_request(sample_weight=None)
: producer may or may not passsample_weight
to the consumer.
To correct this:
fit_request(sample_weight=False)
: producer will NOT pass sample_weight
to the consumer, and error if provided but not requested by any consumer.fit_request(sample_weight=None)
: producer will NOT not pass sample_weight
to the consumer, and will error if user provides sample_weight
and another consumer requests it. The user should be explicit and set sample_weight
to False
here.@adrinjalali I am confused with the semantics of None
. Can you provide a code example where the behavior of False
and None
differs?
@adrinjalali I am confused with the semantics of
None
. Can you provide a code example where the behavior ofFalse
andNone
differs?
# implicit None for sample_weight on SVC
make_pipeline(
StandardScaler().fit_requests(sample_weight=True),
SVC()
).fit(X, y, sample_weight)
# raises: sample_weight is provided but not explicitly set as required or not for some of the estimators
make_pipeline(
StandardScaler().fit_requests(sample_weight=True),
SVC().fit_requests(sample_weight=False)
).fit(X, y, sample_weight)
# passes
I am confused with the semantics of None.
FWIW, I don't think it needs to be represented by None (another sentinel will do), but it's the default behaviour when no request has been set for that key.
I think fit_requests (as opposed to fit_request) is quite clear though
It's an improvement, but I think not so easy for users to remember; and generally methods are better when they start with verbs (although a method that returns self
can sometimes start with a preposition, e.g. with_fit_params
). But everything I come up with is quite a long name, such as request_for_fit
...
request_for_fit
is only 4 chars longer than than fit_requests
, which I think is okay. I updated the PR to use request_for_fit
, and have make_scorer
keep request_metadata
(for now).
The current approach was designed to place no additional implementation burden upon the consumer,
In the SLEPs current form, we are putting all the burden on the producer, which makes it a little harder to create meta-estimators. My sense is that there are more consumers out there than meta-estimators, so I think this is an okay trade off.
I continue to find "producer" confusing. The role of the meta estimator here is to route or distribute or provide, not to produce.
It's not just a matter of meta estimators in the wild being less common, it's also that developing a meta estimator already has quite a high bar to entry, so the relative increase in burden is much smaller than if we told estimator developers they needed to jump through more hoops to support sample weights in a Pipeline. Non compliant sample_weight implementations would also dissuade users from pipeline. So adding to Consumer implementation burden is, imo, a bad idea.
I continue to find "producer" confusing. The role of the meta estimator here is to route or distribute or provide, not to produce.
What do you think of "router"?
I like "router" or "distributor". I had intended to use "router" earlier in the SLEP but I felt at some point that more terminology was not helpful. I appreciate your feedback that such terminology is helpful!
@glemaitre I need to update this SLEP to keep track of the developments in https://github.com/scikit-learn/scikit-learn/pull/21284
Toward https://github.com/scikit-learn/enhancement_proposals/pull/52
Here is a rendered version of the changes.
The goal of this rewrite is to reduce the length of the SLEP, but still capturing the points of the original version.
get_metadata_request
andset_metadata_request
.@adrinjalali Can you show me a case where we need sets in
get_metadata_request
?As noted in https://github.com/scikit-learn/enhancement_proposals/pull/52, I am unsure about having only
cross_validate
andcross_val_score
acceptingmetadata
. Everything else, such asGridSearchCV
will be using**kwargs
.CC @jnothman