Closed jnothman closed 4 years ago
This PR is filling in several TODOs in SLEP006, with an idea to having it ready for voting in Sept 2020, in conjunction with refinements to https://github.com/scikit-learn/scikit-learn/pull/16079.
Do we merge this so that the full SLEP is rendered on rtd?
Sure, it's easier to follow up on the rest of the issues if we have this up there anyway.
Thanks for doing this SLEP and proposing associated implementations! It's certainly a challenging topic and I don't have a good visibility on all the implications/constraints. My comments are mostly on usability with pipelines as outlined in https://github.com/scikit-learn/scikit-learn/issues/18159, in particular,
pipe = make_pipeline(
StandardScaler().request_sample_weight(fit='sample_weight'),
LogisticRegression().request_sample_weight(fit='sample_weight')
)
which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight. And since one could argue that sample weights should be passed to all estimators (https://github.com/scikit-learn/scikit-learn/issues/18159#issuecomment-674971708), the unfortunate reason for this verbosity is mostly to preserve backward compatibility as opposed to really needing fine grained control of where sample weight are sent at fit time.
est.fit(X, y, prop={'groups': groups, 'weight': weight})
goes kind of against the idea of a common scikit-learn API for estimator. As this would create two type of objects, a pipeline like/meta-estimator objects and others. Some will be fitted with est.fit(X, y, prop={'weight': weight})
and others with est.fit(X, y, sample_weight=True)
which goes against the idea that say StandardScaler() + Ridge()
will behave the same as Ridge(normalize=True)
, or that CountVecotorizer() + TfidfTransformer()
= TfidfVectorizer()
. Granted these don't support sample_weight at the moment, but they very well could.
I would interested in exploring a mixture of option proposed in https://github.com/scikit-learn/scikit-learn/issues/18159#issuecomment-684863020 and option 4. For sample weight that would be,
sample_weight
to Pipeline, ColumnTransformer, FeatureUnion. Pass them to all estimators, fail otherwise.StandardScaler().set_estimator.tag(fit_sample_weight=None)
to allow customization if needed.So for sample weight that would be a little similar to option 4 but less verbose and with reasonable default. I'm not yet sure how this would generalize to other sample props, maybe keeping a props
argument for others is fine.
https://github.com/scikit-learn/scikit-learn/issues/18159#issuecomment-675176012 Why treat sample_weight specially?
At least for fit, estimators / pipelines that support sample weight need to verify invariance properties (check_sample_weights_invariance(..., kind="zeros")
common check). So that should constrain where they need to be passed in a pipeline. Also they are probably the most common sample props, and it would be nice to be able to use them without much added code.
Another option for the verbosity issue is to have a utility function to say something like propagate_metadata(pipeline, 'all', 'sample_weights')
Some will be fitted with est.fit(X, y, prop={'weight': weight}) and others with est.fit(X, y, sample_weight=True)
No, I'd see this as choosing to deprecate est.fit(X, y, sample_weight=True)
support, and always use prop
.
pipe = make_pipeline( StandardScaler().request_sample_weight(fit='sample_weight'), LogisticRegression().request_sample_weight(fit='sample_weight') ) which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight.
I'd now see more likely syntax as StandardScaler().request_sample_weight(fit=True)
or StandardScaler(request_sample_weight=True)
.
I appreciate that this could be more usable still. Making sample_weight
requested by default only works, as far as I'm concerned (see https://github.com/scikit-learn/enhancement_proposals/pull/43#discussion_r476466176) if every every estimator (and perhaps scorer) requests sample_weight
and then raises an error if it does not actually support it: the user needs to explicitly turn it off every time sample_weight is used and some component does not support it. It does not seem a reasonable approach for other props at this point.
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html