Completing sample props SLEP006

jnothman commented 4 years ago

https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html

jnothman commented 4 years ago

This PR is filling in several TODOs in SLEP006, with an idea to having it ready for voting in Sept 2020, in conjunction with refinements to https://github.com/scikit-learn/scikit-learn/pull/16079.

jnothman commented 4 years ago

Do we merge this so that the full SLEP is rendered on rtd?

adrinjalali commented 4 years ago

Sure, it's easier to follow up on the rest of the issues if we have this up there anyway.

rth commented 4 years ago

Thanks for doing this SLEP and proposing associated implementations! It's certainly a challenging topic and I don't have a good visibility on all the implications/constraints. My comments are mostly on usability with pipelines as outlined in https://github.com/scikit-learn/scikit-learn/issues/18159, in particular,

with simple pipelines one would have to do something like
```
pipe = make_pipeline(
    StandardScaler().request_sample_weight(fit='sample_weight'), 
    LogisticRegression().request_sample_weight(fit='sample_weight')
)
```
which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight. And since one could argue that sample weights should be passed to all estimators (https://github.com/scikit-learn/scikit-learn/issues/18159#issuecomment-674971708), the unfortunate reason for this verbosity is mostly to preserve backward compatibility as opposed to really needing fine grained control of where sample weight are sent at fit time.
For the "5.4. Keyword arguments vs. a single argument" discussion, single argument
```
est.fit(X, y, prop={'groups': groups, 'weight': weight})
```
goes kind of against the idea of a common scikit-learn API for estimator. As this would create two type of objects, a pipeline like/meta-estimator objects and others. Some will be fitted with est.fit(X, y, prop={'weight': weight}) and others with est.fit(X, y, sample_weight=True) which goes against the idea that say StandardScaler() + Ridge() will behave the same as Ridge(normalize=True) , or that CountVecotorizer() + TfidfTransformer() = TfidfVectorizer(). Granted these don't support sample_weight at the moment, but they very well could.

I would interested in exploring a mixture of option proposed in https://github.com/scikit-learn/scikit-learn/issues/18159#issuecomment-684863020 and option 4. For sample weight that would be,

Add sample_weight to Pipeline, ColumnTransformer, FeatureUnion. Pass them to all estimators, fail otherwise.
Whether an estimator supports sample_weight is stored in an estimator tag (also https://github.com/scikit-learn/scikit-learn/pull/13565)
One can overwrite this tag e.g. StandardScaler().set_estimator.tag(fit_sample_weight=None) to allow customization if needed.

So for sample weight that would be a little similar to option 4 but less verbose and with reasonable default. I'm not yet sure how this would generalize to other sample props, maybe keeping a props argument for others is fine.

https://github.com/scikit-learn/scikit-learn/issues/18159#issuecomment-675176012 Why treat sample_weight specially?

At least for fit, estimators / pipelines that support sample weight need to verify invariance properties (check_sample_weights_invariance(..., kind="zeros") common check). So that should constrain where they need to be passed in a pipeline. Also they are probably the most common sample props, and it would be nice to be able to use them without much added code.

adrinjalali commented 4 years ago

Another option for the verbosity issue is to have a utility function to say something like propagate_metadata(pipeline, 'all', 'sample_weights')

jnothman commented 4 years ago

Some will be fitted with est.fit(X, y, prop={'weight': weight}) and others with est.fit(X, y, sample_weight=True)

No, I'd see this as choosing to deprecate est.fit(X, y, sample_weight=True) support, and always use prop.

pipe = make_pipeline( StandardScaler().request_sample_weight(fit='sample_weight'), LogisticRegression().request_sample_weight(fit='sample_weight') ) which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight.

I'd now see more likely syntax as StandardScaler().request_sample_weight(fit=True) or StandardScaler(request_sample_weight=True).

I appreciate that this could be more usable still. Making sample_weight requested by default only works, as far as I'm concerned (see https://github.com/scikit-learn/enhancement_proposals/pull/43#discussion_r476466176) if every every estimator (and perhaps scorer) requests sample_weight and then raises an error if it does not actually support it: the user needs to explicitly turn it off every time sample_weight is used and some component does not support it. It does not seem a reasonable approach for other props at this point.

scikit-learn / enhancement_proposals

Completing sample props SLEP006 #43