scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

Completing sample props SLEP006 #43

Closed jnothman closed 4 years ago

jnothman commented 4 years ago

https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html

jnothman commented 4 years ago

This PR is filling in several TODOs in SLEP006, with an idea to having it ready for voting in Sept 2020, in conjunction with refinements to https://github.com/scikit-learn/scikit-learn/pull/16079.

jnothman commented 4 years ago

Do we merge this so that the full SLEP is rendered on rtd?

adrinjalali commented 4 years ago

Sure, it's easier to follow up on the rest of the issues if we have this up there anyway.

rth commented 4 years ago

Thanks for doing this SLEP and proposing associated implementations! It's certainly a challenging topic and I don't have a good visibility on all the implications/constraints. My comments are mostly on usability with pipelines as outlined in https://github.com/scikit-learn/scikit-learn/issues/18159, in particular,

I would interested in exploring a mixture of option proposed in https://github.com/scikit-learn/scikit-learn/issues/18159#issuecomment-684863020 and option 4. For sample weight that would be,

So for sample weight that would be a little similar to option 4 but less verbose and with reasonable default. I'm not yet sure how this would generalize to other sample props, maybe keeping a props argument for others is fine.

https://github.com/scikit-learn/scikit-learn/issues/18159#issuecomment-675176012 Why treat sample_weight specially?

At least for fit, estimators / pipelines that support sample weight need to verify invariance properties (check_sample_weights_invariance(..., kind="zeros") common check). So that should constrain where they need to be passed in a pipeline. Also they are probably the most common sample props, and it would be nice to be able to use them without much added code.

adrinjalali commented 4 years ago

Another option for the verbosity issue is to have a utility function to say something like propagate_metadata(pipeline, 'all', 'sample_weights')

jnothman commented 4 years ago

Some will be fitted with est.fit(X, y, prop={'weight': weight}) and others with est.fit(X, y, sample_weight=True)

No, I'd see this as choosing to deprecate est.fit(X, y, sample_weight=True) support, and always use prop.

pipe = make_pipeline( StandardScaler().request_sample_weight(fit='sample_weight'), LogisticRegression().request_sample_weight(fit='sample_weight') ) which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight.

I'd now see more likely syntax as StandardScaler().request_sample_weight(fit=True) or StandardScaler(request_sample_weight=True).

I appreciate that this could be more usable still. Making sample_weight requested by default only works, as far as I'm concerned (see https://github.com/scikit-learn/enhancement_proposals/pull/43#discussion_r476466176) if every every estimator (and perhaps scorer) requests sample_weight and then raises an error if it does not actually support it: the user needs to explicitly turn it off every time sample_weight is used and some component does not support it. It does not seem a reasonable approach for other props at this point.