scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
48 stars 34 forks source link

SLEP005: Resampler API #15

Open glemaitre opened 5 years ago

glemaitre commented 5 years ago

Here is the SLEP regarding the Outlier Rejection API

amueller commented 5 years ago

I was expecting "the fit_resample slep" to be about imbalanced data. I think subsampling is a much more common issue then trying to automatically remove outliers.

Do you have a real-world application where this would help? ;)

GaelVaroquaux commented 5 years ago

Indeed, as I was telling Guillaume IRL, I think that this SLEP should be about fit_resample, rather than outlier rejection, and outlier rejection should be listed as an application. I am not convinced that proposing a new mixing calls for a SLEP in itself. It's the addition of the method fit_resample that calls for the SLEP.

glemaitre commented 5 years ago

OK, so I will modify the SLEP to make it about resamplers. Basically, I should keep the part about the implementation of the pipeline with the limitations. @orausch made a good suggestion.

glemaitre commented 5 years ago

Do you have a real-world application where this would help? ;)

@amueller This is indeed a really good point which should be demonstrated in the outlier PR.

chkoar commented 5 years ago

Will resamplers applied in transform? I got confused.

glemaitre commented 5 years ago

Will resamplers applied in transform? I got confused.

Nop resampler will be implementing fit_resample and will not implement fit_transform or transform.

chkoar commented 5 years ago

I wonder because of this https://github.com/scikit-learn/enhancement_proposals/pull/15/commits/4ecc51bc00f113f8be9b6c4aad8ee477e3c9a02c#diff-fa5683f6b9de871ce2af02905affbdaaR80

orausch commented 5 years ago

I wonder because of this 4ecc51b#diff-fa5683f6b9de871ce2af02905affbdaaR80

Sorry, that is a mistake on my part. They should not be applied on transform.

EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.

chkoar commented 5 years ago

EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.

Yes, this make sense.

orausch commented 5 years ago

Currently, my PR applies resamplers on transform (and so does imblearn). It seems that it is not trivial to change the behavior so that, for fit_transform, resamplers are only applied during the fit, and not during the transform. Currently, the pipeline is efficient in the sense that each transformer is only called once during a fit_transform call.

To get the behavior described above, a naive implementation would have to call each transformer after the first resampler twice: once for the fit path, where we apply resamplers, and once for the transform path, where we don't. It seems to me that, in order to do it efficiently, we would need some knowledge of what samples were added/removed by the resamplers in the pipeline.

If we want to make transform not apply resamplers, we would have to address this problem in the pipeline.

This brings me to the next thought: does it even make sense to have resamplers in a transformer only pipeline? Is there a good usecase? One choice would be to simply disallow this behavior (similarly to how we disallow resamplers for fit_predict).

glemaitre commented 5 years ago

So I update the SLEP toward a more fit_resample discussion.

I realised (even if it is obvious) that outlier rejection is unsupervised while resampling for balancing is supervised (and for binary/multiclass classification) AFAIK. In the latter case, resampler will require to validate the targets and define an API for driving the resampling (i.e. sampling_strategy in imblearn)

Is this API choice should be discussed within the SLEP as well or this is more specific to a type of resampler and it will be handled later on.

glemaitre commented 5 years ago

Currently, my PR applies resamplers on transform (and so does imblearn).

https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py#L487

We are skipping the resampler during transform.

glemaitre commented 5 years ago

@orausch I am a bit confused with your comment.

When calling fit_transform on a pipeline, we will call all fit_transform or fit_resample of all estimators but the last. The last will then call fit.transform or fit_transform.

For transform on a pipeline, we will call transform of all estimators but the last, skipping the ones with fit_resample. Finally, we call transform of the last estimator.

Could you explain where do you think we are applying the resampling exactly, maybe I am missing something?

orausch commented 5 years ago

We are skipping the resampler during transform.

Sorry, I missed that. I think there is a problem regardless.

When calling fit_transform on a pipeline, we will call all fit_transform or fit_resample of all estimators but the last. The last will then call fit.transform or fit_transform.

If we do this, we have inconsistent behavior between pipe.fit_transform(X, y) and pipe.fit(X, y).transform(X). For example consider:

X = [A, B, C] # A, B, C are feature vectors 
y = [a, b, c]
pipe = make_pipeline(removeB, mult2) # removeB is a resampler that will remove B, mult2 is a transformer

Then pipe.fit_transform(X, y) gives [2A,2C] but pipe.fit(X, y).transform(X) gives [2A, 2B, 2C]

chkoar commented 5 years ago

At least in the imbalanced setting use case usually you will have at the last step either a classifier or a resampler, not a transformer. I suppose that the same happens in the outlier rejection. In your example it make sense to resample also in pipeline's transform, right? But what if you transform again with a X2? Does it make sense to resample on X2 when you only need the transformation? It seems like you need to resample in transform when you will pass the same fitted data again, in order to not break the contract. Thoughts?

orausch commented 5 years ago

Some options to address this:

Or we can implement the behavior that for fit_transform, we apply resamplers on fit, but not on transform. Here I can also see two options

Let me know if I missed something.

glemaitre commented 5 years ago

@amueller @jnothman @adrinjalali @GaelVaroquaux @agramfort Any thoughts about the issue raised by @orausch https://github.com/scikit-learn/enhancement_proposals/pull/15#issuecomment-468968032

adrinjalali commented 5 years ago

I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing):

  1. In the context of FAT-ML, assume the following pipeline:
    • resample to tackle class imbalance
    • mutate the data for the purpose of a more "fair" data (which may or may not touch y)
    • usual transformers
    • an estimator

In the above pipeline, during fit, I'd like the first two steps to be on, and during predict, I'd like them to be off, which we can do if the first two steps are resamplers and they're automatically out during the predict.

  1. I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them. The pipeline would look like:
    • exclude outliers
    • usual transformers
    • predict faulty behavior

In contrast to the previous usecase, now I'd like the first step to be always on, cause I need to ignore those data points from my analysis.

Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like resample_on_transform for the pipeline, to tune the behavior of the pipeline regarding these resamplers as its steps. I'm not sure if it completely solves our issues, but it may.

For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the resample_on_transform of each pipeline appropriately, and include those pipelines in their main pipeline.

I haven't managed to completely think these different cases through, but I hope I could convey the idea.

amueller commented 5 years ago

I'm not sure I buy the second use case. Why is the outlier removal in the pipeline? I would assume it's some simple heuristic like too big a change from yesterday or a value way out of the range. I haven't heard anyone use outlier detection models in practice for something like that.

I guess you could argue that estimating the range should probably be done on the training set only in Cross-Validierung. Though I would argue in this use case you want a human to review the ranges and not determine them automatically.

Actually a use-case that I think is more realistic is a user tagging obvious outliers so it's a supervised classification problem and you use a classifier to reject outliers before they go to the main processing pipeline. I'm not sure if we want to support this use-case - in the end you could always just add another class to the main problem that is "outlier".

Sent from phone. Please excuse spelling and brevity.

On Tue, Mar 5, 2019, 08:30 Adrin Jalali notifications@github.com wrote:

I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing):

  1. In the context of FAT-ML, assume the following pipeline:

    • resample to tackle class imbalance
    • mutate the data for the purpose of a more "fair" data (which may or may not touch y)
    • usual transformers
    • an estimator

In the above pipeline, during fit, I'd like the first two steps to be on, and during predict, I'd like them to be off, which we can do if the first two steps are resamplers and they're automatically out during the predict.

  1. I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them. The pipeline would look like:

    • exclude outliers
    • usual transformers
    • predict faulty behavior

In contrast to the previous usecase, now I'd like the first step to be always on, cause I need to ignore those data points from my analysis.

Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like resample_on_transform for the pipeline, to tune the behavior of the pipeline regarding these resamplers as its steps. I'm not sure if it completely solves our issues, but it may.

For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the resample_on_transform of each pipeline appropriately, and include those pipelines in their main pipeline.

I haven't managed to completely think these different cases through, but I hope I could convey the idea.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/15#issuecomment-469680058, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFoG2UyVVVX62qJC_lf3PQPax02O2ks5vTnGRgaJpZM4bZCV9 .

amueller commented 5 years ago

Given the controversies we had about inconsistencies between fit_transform and fit.transform I feel like disallowing fit_transform on a pipelone might be an option. But that's kind of dangerous: some meta-estimators my instead call fit.transform assuming they are the same. Also it will create issues when nesting pipelines, though I don't see a reason why we would want to do that here (I can make up usecases but I'm fine with an implementation that doesn't allow those for now).

Sent from phone. Please excuse spelling and brevity.

On Sat, Mar 2, 2019, 18:41 Christos Aridas notifications@github.com wrote:

At least in the imbalanced setting use case usually you will have at the last step either a classifier or a resampler, not a transformer. I suppose that the same happens in the outlier rejection. In your example it make sense to resample also in pipeline's transform, right? But what if you transform again with a X2? Does it make sense to resample on X2 when you only need the transformation? It seems like you need to resample in transform when you will pass the same fitted data again, in order to not break the contract. Thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/15#issuecomment-468971205, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFjyL80ksJ1We8E63Ruq2rtw8bChqks5vSwwWgaJpZM4bZCV9 .

orausch commented 5 years ago

I'm not sure if we want to support this use-case - in the end you could always just add another class to the main problem that is "outlier".

Although sklearn already includes Novetly detection models, it would be impossible to support this under the current API either way. This is because fit_resample is stateless (as in it fits and resamples on the same input data). Supporting this correctly would require a separate resample verb.

amueller commented 5 years ago

@orausch sorry I don't follow. The simple solution to the situation I described is just a multi-class problem and there is no resampling necessary.

orausch commented 5 years ago

Sorry, I was referring to implementing novelty rejection under the resampler API (usecase 2 of @adrinjalali's comment)

orausch commented 5 years ago

Let me rephrase to clarify:

2. I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them.

This is not possible under the proposed resampler API. fit_resample is currently stateless, in that it fits and resamples on the same data, and returns the resampled result. If i understand this usecase correctly, this "novelty rejection" would require you to be able to separately fit the resampler on the train set, and call resample whenever you predict. Since we only have a fit_resample method, this is not possible.

jnothman commented 5 years ago

I think the predictive maintenance example is one for hierarchical classification with outlier detectors, not merely for outlier removal during training.

I think the SLEP should go through the series of methods on a Pipeline and note the proposed behaviour and the implications for Resampler/Pipeline design.

jnothman commented 5 years ago

@orausch, @chkoar I think this deserves a little bit of ironing out, then some beefing up with examples. Can we see it progress?

jnothman commented 5 years ago

I am now wondering whether the constraints here are too confusing, and whether the usability gain of being able to do this with the existing Pipeline (and avoid some nesting) is worth much. Instead we could offer a specialised meta-estimator:

class ResampledTrainer(MetaEstimatorMixin, BaseEstimator):
    def __init__(self, resample, predictor):
        self.resample = resample
        self.predictor = predictor

    def fit(self, X, y=None, **kwargs):
        X, y, kw = self.resample(X, y, **kw)
        self.predictor_ = clone(self.predictor).fit(X, y, **kw)
        return self

    @if_delegate_has_method(delegate='predictor_')
    def predict(self, X, **predict_params):
        return self.predictor_.predict(X, **predict_params)

    @if_delegate_has_method(delegate='predictor_')
    def predict_proba(self, X):
        return self.predictor_.predict_proba(X)

    @if_delegate_has_method(delegate='predictor_')
    def predict_log_proba(self, X):
        return self.predictor_.predict_log_proba(X)

    @if_delegate_has_method(delegate='predictor_')
    def decision_function(self, X):
        return self.predictor_.decision_function(X)

    @if_delegate_has_method(delegate='predictor_')
    def score(self, X):
        return self.predictor_.score(X)

    @property
    def fit_transform(self):
        transform = self.predictor_.transform

        def fit_transform(X, y, **kwargs):
            self.fit(X, y, **kwargs)
            return transform(X)

    @property
    def fit_predict(self):
        predict = self.predictor_.predict

        def fit_predict(X, y, **kwargs):
            self.fit(X, y, **kwargs)
            return predict(X)

    @property
    def classes_(self):
        return self.predictor_.classes_

Here, resample would be a callable which takes (X, y, **kw) and returns (X, y, kw). This is not entirely unlike score_func in univariate feature selection, which is similarly called only at fit time. Note that such functions can be parametrised for use with set_params by defining set/get_params, for which I provide a prototype generic wrapper in https://gist.github.com/jnothman/ba46247a36d375136a6662d1b1ef4c6d. We presume that there is no interesting "model" or statistics stored by the resampler, which may make it harder to diagnose issues.

jnothman commented 5 years ago

On the other hand we could require resample there to be an estimator defining fit_resample, which would make it easier for users to exploit resampling capability of existing estimators, such as outlier detectors and kmeans for sample reduction. Basically, introducing ResampledTrainer would get around the confusion of adding this functionality to Pipeline.

jnothman commented 5 years ago

I think this has a chance of making it into 0.22 unlike slep006... I'm happy to take over its drafting if @glemaitre, @chkoar, @orausch would like me to.

orausch commented 5 years ago

I have time to help out with this.

I think this deserves a little bit of ironing out, then some beefing up with examples.

What kinds of examples are you looking for?

jnothman commented 5 years ago

Let's start with making the description of behaviour in a Pipeline (if we don't just want to start with a specialised ResampledTrainer) more precise, and describe the return value of fit_resample, and then we can review if examples and use cases would help make those things more concrete.

chkoar commented 5 years ago

Let's start with making the description of behaviour in a Pipeline (if we don't just want to start with a specialised ResampledTrainer) more precise, and describe the return value of fit_resample

I believe that this part is the most crucial in the whole process since everyone has different opinion and different expectations of the behaviour, so let's agree on this.

jnothman commented 5 years ago

I think a ResampledTrainer means we can get away with modifying Pipeline at some later stage. Otherwise, if we want a table of Pipeline methods, it deserves more detail, describing how the Pipeline delegates to estimators in the head (which here I'm defining as "all but the last step") and tail:

For example: when calling pipe.fit_transform(X, y)

Note this disregards kwargs so far. It assumes the Pipeline starts with Xt, yt = X, y.

We can also note a requirement that fit_resample not change the order or meaning of features.

jnothman commented 5 years ago

But I'm really leaning away from modifying Pipeline in this iteration. Defining the fit_resample verb is enough for the SLEP, giving examples in outlier detectors, KMeans and a new Resample (or SimpleResampler) estimator, alongside an implementation of sklearn.compose.ResampledTrainer

glemaitre commented 5 years ago

But I'm really leaning away from modifying Pipeline in this iteration

I really like the idea of the meta-estimator for the training. The changes in the Pipeline are not trivial and it might be wise to have a meta estimator specialized for the use case of training. @ogrisel was proposing exactly this implementation this morning before to be aware about @jnothman proposal.

@glemaitre, @chkoar, @orausch would like me to.

I would like to dedicate some time to it. So we can team up. Let me know if you want me to draft, review, or address some comments.

ogrisel commented 5 years ago

For the imbalanced classification problem I was about to suggest a solution very similar to @jnothman's suggestion from https://github.com/scikit-learn/enhancement_proposals/pull/15#issuecomment-504797660 but that would further implement baseline resampling strategies based on relative class frequencies in y (e.g. under-sampling of the majority classes or over-sampling of the minority classes).

I would be very much in favor of exploring the meta-estimator approach before complexifying the estimator API and pipeline implementation.

ogrisel commented 5 years ago

Similarly, BaggingClassifier could also be slightly improved to deal with imbalanced classes by resampling based on y frequencies according to some common baseline strategies.

I think this would give us a very good generic solution to tackle with the imbalanced classification problem.

jnothman commented 5 years ago

For rebalancing based on relative class frequencies in y, we still have

1454...

jnothman commented 5 years ago

And I don't think it's necessary for the slep.

ogrisel commented 5 years ago

And I don't think it's necessary for the slep.

My point is do we really need a SLEP to add a new fit_resample public API that impacts the pipeline if the main usecases (fitting models with various resampling strategy at train time only, e.g. for imbalanced classes) is already addressed by the use of a bunch of meta-estimators.

adrinjalali commented 5 years ago

My point is do we really need a SLEP to add a new fit_resample public API that impacts the pipeline if the main usecases (fitting models with various resampling strategy at train time only, e.g. for imbalanced classes) is already addressed by the use of a bunch of meta-estimators.

I think that means the transformers work on the whole data, which may not be what one wants. We may want to apply the resampler before the other transformers.

jnothman commented 5 years ago

Ummm ... I think the introduction of fit_resample could be regarded as a change to API principles. I think it would be wise to slep it and I don't think there will be much difficulty reaching consensus.

jnothman commented 5 years ago

Not sure what you mean, Adrin. I think the ResampledTrainer is pretty straightforward in that it can be a transformer but will transform the entire input with a model trained on resampled data. Its estimator can be a pipeline too.

adrinjalali commented 5 years ago

Its estimator can be a pipeline too.

True.

jnothman commented 5 years ago

A further decision point is whether a resampler must provide fit

amueller commented 5 years ago

I think I agree with @ogrisel about exploring the meta-estimator approach. I was against doing the API extension for a while, but then kind of changed my mind and started using imbalance-learn in class. Now I'm not entirely sure any more. I think we really should do a side-by-side comparison of what the user code would look like with either approach.

One this that would potentially be quite ugly is if the sampling is somewhere in the middle of the pipeline because then it would be

make_pipeline(..., ResamplingMeta(make_pipeline(...)))

which would make it hard to iterate over the steps. I'm not sure how common the use-case is, though. I guess you want to scale the data before applying Birch?

make_pipeline(StandardScaler(),
              ResamplingMeta(make_pipeline(SelectKBest(), SVC()), resample=Birch())

is harder to work with than

make_pipeline(StandardScaler(), Birch(), SelectKBest(), SVC())

Is that a realistic use-case, though? Doing

make_pipeline(StandardScaler(), SelectKBest(),
              ResamplingMeta(SVC(), resample=Birch())

might be sufficient? And in many cases the resampling doesn't require preprocessing so you could do

ResamplingMeta(make_pipeline(StandardScaler(), SelectKBest(), SVC()),
               resample='minority')

which is not that bad either.

Do we want to apply the 80/20 rule?

jnothman commented 5 years ago

I agree that there could be usability benefits to enabling this in a Pipeline, but it requires relatively messy code and hard-to-explain semantics e.g. in the case of fit_transform. In a Pipeline, a resampler looks like a transformer but behaves unlike one. We also haven't resolved the handling of sample props output from fit_resample in a Pipeline.

A meta-estimator is unambiguous. It is also expedient, and we could transition to a Pipeline solution in the future once we have a clear and functional fit_resample API. At this point, the specialised meta-estimator is my preferred approach towards releasing this functionality.

orausch commented 5 years ago

Another advantage of the Pipeline approach is easy chaining of resamplers. Here's a common usecase I can imagine:

clf = ResampledTrainer(
    NaNRejector(), # removes samples containing NaN
    ResampledTrainer(RandomUnderSampler(),
        make_pipeline(StandardScaler(), SGDClassifier()))
)

vs

clf = make_pipeline(NaNRejector(), RandomUnderSampler(), StandardScaler(), SGDClassifer())
jnothman commented 5 years ago

I don't think chaining resamplers is such a common use-case. I don't really think there's any question that it's easier to write code with a Pipeline. But it creates some maintenance and documentation headaches, and some ambiguities (the scope of where you use the resampled data and where you use the original data is a bit unclear).

Btw, @orausch, I wasn't expecting you to merge my PR to your branch quite so readily, for what it's worth. I was proposing what it would look like as an alternative.

We could go with @amueller's suggestion to consider both paths. But I'm still tempted to take the expedient approach of clearly defining fit_resample and providing a meta-estimator until we can make Pipeline work in a satisfying way.

glemaitre commented 5 years ago

IMO, we are having the same complexity than with the stacker. Having the meta-estimator will already answer the common need and even complex one (with some cumbersome code) but we don't restrain or change some API rule which will need time to converge.

We should probably frame these API changes but I think that having the meta estimator at first will be best.