Open glemaitre opened 5 years ago
I was expecting "the fit_resample slep" to be about imbalanced data. I think subsampling is a much more common issue then trying to automatically remove outliers.
Do you have a real-world application where this would help? ;)
Indeed, as I was telling Guillaume IRL, I think that this SLEP should be about fit_resample, rather than outlier rejection, and outlier rejection should be listed as an application. I am not convinced that proposing a new mixing calls for a SLEP in itself. It's the addition of the method fit_resample
that calls for the SLEP.
OK, so I will modify the SLEP to make it about resamplers. Basically, I should keep the part about the implementation of the pipeline with the limitations. @orausch made a good suggestion.
Do you have a real-world application where this would help? ;)
@amueller This is indeed a really good point which should be demonstrated in the outlier PR.
Will resamplers applied in transform
? I got confused.
Will resamplers applied in transform? I got confused.
Nop resampler will be implementing fit_resample
and will not implement fit_transform
or transform
.
I wonder because of this 4ecc51b#diff-fa5683f6b9de871ce2af02905affbdaaR80
Sorry, that is a mistake on my part. They should not be applied on transform.
EDIT: and accordingly, during fit_transform
, resamplers are only applied during the fit, and not the transform.
EDIT: and accordingly, during fit_transform, resamplers are only applied during the fit, and not the transform.
Yes, this make sense.
Currently, my PR applies resamplers on transform
(and so does imblearn
). It seems that it is not trivial to change the behavior so that, for fit_transform
, resamplers are only applied during the fit, and not during the transform. Currently, the pipeline is efficient in the sense that each transformer is only called once during a fit_transform
call.
To get the behavior described above, a naive implementation would have to call each transformer after the first resampler twice: once for the fit path, where we apply resamplers, and once for the transform path, where we don't. It seems to me that, in order to do it efficiently, we would need some knowledge of what samples were added/removed by the resamplers in the pipeline.
If we want to make transform
not apply resamplers, we would have to address this problem in the pipeline.
This brings me to the next thought: does it even make sense to have resamplers in a transformer only pipeline? Is there a good usecase? One choice would be to simply disallow this behavior (similarly to how we disallow resamplers for fit_predict
).
So I update the SLEP toward a more fit_resample
discussion.
I realised (even if it is obvious) that outlier rejection is unsupervised while resampling for balancing is supervised (and for binary/multiclass classification) AFAIK. In the latter case, resampler will require to validate the targets and define an API for driving the resampling (i.e. sampling_strategy in imblearn)
Is this API choice should be discussed within the SLEP as well or this is more specific to a type of resampler and it will be handled later on.
Currently, my PR applies resamplers on transform (and so does imblearn).
https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py#L487
We are skipping the resampler during transform
.
@orausch I am a bit confused with your comment.
When calling fit_transform
on a pipeline, we will call all fit_transform
or fit_resample
of all estimators but the last. The last will then call fit.transform
or fit_transform
.
For transform
on a pipeline, we will call transform
of all estimators but the last, skipping the ones with fit_resample
. Finally, we call transform
of the last estimator.
Could you explain where do you think we are applying the resampling exactly, maybe I am missing something?
We are skipping the resampler during
transform
.
Sorry, I missed that. I think there is a problem regardless.
When calling
fit_transform
on a pipeline, we will call allfit_transform
orfit_resample
of all estimators but the last. The last will then callfit.transform
orfit_transform
.
If we do this, we have inconsistent behavior between pipe.fit_transform(X, y)
and pipe.fit(X, y).transform(X)
. For example consider:
X = [A, B, C] # A, B, C are feature vectors
y = [a, b, c]
pipe = make_pipeline(removeB, mult2) # removeB is a resampler that will remove B, mult2 is a transformer
Then pipe.fit_transform(X, y)
gives [2A,2C]
but pipe.fit(X, y).transform(X)
gives [2A, 2B, 2C]
At least in the imbalanced setting use case usually you will have at the last step either a classifier or a resampler, not a transformer. I suppose that the same happens in the outlier rejection. In your example it make sense to resample also in pipeline's transform, right? But what if you transform again with a X2
? Does it make sense to resample on X2
when you only need the transformation? It seems like you need to resample in transform when you will pass the same fitted data again, in order to not break the contract. Thoughts?
Some options to address this:
fit_transform
on a pipeline containing resamplers. (like we do with fit_predict
).transform
applies resamplers (then the approach from imblearn works).Or we can implement the behavior that for fit_transform
, we apply resamplers on fit, but not on transform. Here I can also see two options
fit
and then transform
).return_idx
is an optional kwarg to fit_resample
. If the resampler supports returning idx, we do it efficiently, if not, we call fit
, then transform
.Let me know if I missed something.
@amueller @jnothman @adrinjalali @GaelVaroquaux @agramfort Any thoughts about the issue raised by @orausch https://github.com/scikit-learn/enhancement_proposals/pull/15#issuecomment-468968032
I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing):
y
)In the above pipeline, during fit
, I'd like the first two steps to be on
, and during predict
, I'd like them to be off
, which we can do if the first two steps are resampler
s and they're automatically out during the predict
.
In contrast to the previous usecase, now I'd like the first step to be always on
, cause I need to ignore those data points from my analysis.
Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like resample_on_transform
for the pipeline, to tune the behavior of the pipeline regarding these resamplers as its steps. I'm not sure if it completely solves our issues, but it may.
For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the resample_on_transform
of each pipeline appropriately, and include those pipelines in their main pipeline.
I haven't managed to completely think these different cases through, but I hope I could convey the idea.
I'm not sure I buy the second use case. Why is the outlier removal in the pipeline? I would assume it's some simple heuristic like too big a change from yesterday or a value way out of the range. I haven't heard anyone use outlier detection models in practice for something like that.
I guess you could argue that estimating the range should probably be done on the training set only in Cross-Validierung. Though I would argue in this use case you want a human to review the ranges and not determine them automatically.
Actually a use-case that I think is more realistic is a user tagging obvious outliers so it's a supervised classification problem and you use a classifier to reject outliers before they go to the main processing pipeline. I'm not sure if we want to support this use-case - in the end you could always just add another class to the main problem that is "outlier".
Sent from phone. Please excuse spelling and brevity.
On Tue, Mar 5, 2019, 08:30 Adrin Jalali notifications@github.com wrote:
I don't think there's one solution which works for all usecases, here are two real world examples (and I hope they're convincing):
In the context of FAT-ML, assume the following pipeline:
- resample to tackle class imbalance
- mutate the data for the purpose of a more "fair" data (which may or may not touch y)
- usual transformers
- an estimator
In the above pipeline, during fit, I'd like the first two steps to be on, and during predict, I'd like them to be off, which we can do if the first two steps are resamplers and they're automatically out during the predict.
I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them. The pipeline would look like:
- exclude outliers
- usual transformers
- predict faulty behavior
In contrast to the previous usecase, now I'd like the first step to be always on, cause I need to ignore those data points from my analysis.
Also, we should consider the fact that once we have this third type of model, we'd have at least three types of pipelines, i.e. estimator, transformer, and resampler pipelines. I feel like this fact, plus the above two usecases, would justify a parameter like resample_on_transform for the pipeline, to tune the behavior of the pipeline regarding these resamplers as its steps. I'm not sure if it completely solves our issues, but it may.
For instance, if the user wants to mix these behaviors, they can put different resamplers into different pipelines, set the resample_on_transform of each pipeline appropriately, and include those pipelines in their main pipeline.
I haven't managed to completely think these different cases through, but I hope I could convey the idea.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/15#issuecomment-469680058, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFoG2UyVVVX62qJC_lf3PQPax02O2ks5vTnGRgaJpZM4bZCV9 .
Given the controversies we had about inconsistencies between fit_transform and fit.transform I feel like disallowing fit_transform on a pipelone might be an option. But that's kind of dangerous: some meta-estimators my instead call fit.transform assuming they are the same. Also it will create issues when nesting pipelines, though I don't see a reason why we would want to do that here (I can make up usecases but I'm fine with an implementation that doesn't allow those for now).
Sent from phone. Please excuse spelling and brevity.
On Sat, Mar 2, 2019, 18:41 Christos Aridas notifications@github.com wrote:
At least in the imbalanced setting use case usually you will have at the last step either a classifier or a resampler, not a transformer. I suppose that the same happens in the outlier rejection. In your example it make sense to resample also in pipeline's transform, right? But what if you transform again with a X2? Does it make sense to resample on X2 when you only need the transformation? It seems like you need to resample in transform when you will pass the same fitted data again, in order to not break the contract. Thoughts?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/15#issuecomment-468971205, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFjyL80ksJ1We8E63Ruq2rtw8bChqks5vSwwWgaJpZM4bZCV9 .
I'm not sure if we want to support this use-case - in the end you could always just add another class to the main problem that is "outlier".
Although sklearn already includes Novetly detection models, it would be impossible to support this under the current API either way. This is because fit_resample
is stateless (as in it fits and resamples on the same input data). Supporting this correctly would require a separate resample
verb.
@orausch sorry I don't follow. The simple solution to the situation I described is just a multi-class problem and there is no resampling necessary.
Sorry, I was referring to implementing novelty rejection under the resampler API (usecase 2 of @adrinjalali's comment)
Let me rephrase to clarify:
2. I get periodic data from a bunch of sensors installed in a factory, and I need to do predictive maintenance. Sometimes the data is clearly off the charts and I know from the domain knowledge that I can and should safely ignore them.
This is not possible under the proposed resampler API. fit_resample
is currently stateless, in that it fits and resamples on the same data, and returns the resampled result. If i understand this usecase correctly, this "novelty rejection" would require you to be able to separately fit the resampler on the train set, and call resample
whenever you predict. Since we only have a fit_resample
method, this is not possible.
I think the predictive maintenance example is one for hierarchical classification with outlier detectors, not merely for outlier removal during training.
I think the SLEP should go through the series of methods on a Pipeline and note the proposed behaviour and the implications for Resampler/Pipeline design.
@orausch, @chkoar I think this deserves a little bit of ironing out, then some beefing up with examples. Can we see it progress?
I am now wondering whether the constraints here are too confusing, and whether the usability gain of being able to do this with the existing Pipeline (and avoid some nesting) is worth much. Instead we could offer a specialised meta-estimator:
class ResampledTrainer(MetaEstimatorMixin, BaseEstimator):
def __init__(self, resample, predictor):
self.resample = resample
self.predictor = predictor
def fit(self, X, y=None, **kwargs):
X, y, kw = self.resample(X, y, **kw)
self.predictor_ = clone(self.predictor).fit(X, y, **kw)
return self
@if_delegate_has_method(delegate='predictor_')
def predict(self, X, **predict_params):
return self.predictor_.predict(X, **predict_params)
@if_delegate_has_method(delegate='predictor_')
def predict_proba(self, X):
return self.predictor_.predict_proba(X)
@if_delegate_has_method(delegate='predictor_')
def predict_log_proba(self, X):
return self.predictor_.predict_log_proba(X)
@if_delegate_has_method(delegate='predictor_')
def decision_function(self, X):
return self.predictor_.decision_function(X)
@if_delegate_has_method(delegate='predictor_')
def score(self, X):
return self.predictor_.score(X)
@property
def fit_transform(self):
transform = self.predictor_.transform
def fit_transform(X, y, **kwargs):
self.fit(X, y, **kwargs)
return transform(X)
@property
def fit_predict(self):
predict = self.predictor_.predict
def fit_predict(X, y, **kwargs):
self.fit(X, y, **kwargs)
return predict(X)
@property
def classes_(self):
return self.predictor_.classes_
Here, resample would be a callable which takes (X, y, **kw)
and returns (X, y, kw)
. This is not entirely unlike score_func
in univariate feature selection, which is similarly called only at fit time. Note that such functions can be parametrised for use with set_params
by defining set/get_params
, for which I provide a prototype generic wrapper in https://gist.github.com/jnothman/ba46247a36d375136a6662d1b1ef4c6d. We presume that there is no interesting "model" or statistics stored by the resampler, which may make it harder to diagnose issues.
On the other hand we could require resample
there to be an estimator defining fit_resample
, which would make it easier for users to exploit resampling capability of existing estimators, such as outlier detectors and kmeans for sample reduction. Basically, introducing ResampledTrainer would get around the confusion of adding this functionality to Pipeline.
I think this has a chance of making it into 0.22 unlike slep006... I'm happy to take over its drafting if @glemaitre, @chkoar, @orausch would like me to.
I have time to help out with this.
I think this deserves a little bit of ironing out, then some beefing up with examples.
What kinds of examples are you looking for?
Let's start with making the description of behaviour in a Pipeline (if we don't just want to start with a specialised ResampledTrainer) more precise, and describe the return value of fit_resample, and then we can review if examples and use cases would help make those things more concrete.
Let's start with making the description of behaviour in a Pipeline (if we don't just want to start with a specialised ResampledTrainer) more precise, and describe the return value of fit_resample
I believe that this part is the most crucial in the whole process since everyone has different opinion and different expectations of the behaviour, so let's agree on this.
I think a ResampledTrainer means we can get away with modifying Pipeline at some later stage. Otherwise, if we want a table of Pipeline methods, it deserves more detail, describing how the Pipeline delegates to estimators in the head (which here I'm defining as "all but the last step") and tail:
est
with fit_resample
est
est
For example: when calling pipe.fit_transform(X, y)
est
with fit_resample: Xt, yt = est.fit_resample(Xt, yt)
est
: Xt, yt = est.fit_transform(Xt, yt)
est
: est.fit(Xt, yt); return self.transform(X, y)
Note this disregards kwargs so far. It assumes the Pipeline starts with Xt, yt = X, y
.
We can also note a requirement that fit_resample not change the order or meaning of features.
But I'm really leaning away from modifying Pipeline in this iteration. Defining the fit_resample
verb is enough for the SLEP, giving examples in outlier detectors, KMeans and a new Resample (or SimpleResampler) estimator, alongside an implementation of sklearn.compose.ResampledTrainer
But I'm really leaning away from modifying Pipeline in this iteration
I really like the idea of the meta-estimator for the training. The changes in the Pipeline
are not trivial and it might be wise to have a meta estimator specialized for the use case of training. @ogrisel was proposing exactly this implementation this morning before to be aware about @jnothman proposal.
@glemaitre, @chkoar, @orausch would like me to.
I would like to dedicate some time to it. So we can team up. Let me know if you want me to draft, review, or address some comments.
For the imbalanced classification problem I was about to suggest a solution very similar to @jnothman's suggestion from https://github.com/scikit-learn/enhancement_proposals/pull/15#issuecomment-504797660 but that would further implement baseline resampling strategies based on relative class frequencies in y (e.g. under-sampling of the majority classes or over-sampling of the minority classes).
I would be very much in favor of exploring the meta-estimator approach before complexifying the estimator API and pipeline implementation.
Similarly, BaggingClassifier
could also be slightly improved to deal with imbalanced classes by resampling based on y frequencies according to some common baseline strategies.
I think this would give us a very good generic solution to tackle with the imbalanced classification problem.
For rebalancing based on relative class frequencies in y, we still have
And I don't think it's necessary for the slep.
And I don't think it's necessary for the slep.
My point is do we really need a SLEP to add a new fit_resample
public API that impacts the pipeline if the main usecases (fitting models with various resampling strategy at train time only, e.g. for imbalanced classes) is already addressed by the use of a bunch of meta-estimators.
My point is do we really need a SLEP to add a new fit_resample public API that impacts the pipeline if the main usecases (fitting models with various resampling strategy at train time only, e.g. for imbalanced classes) is already addressed by the use of a bunch of meta-estimators.
I think that means the transformers work on the whole data, which may not be what one wants. We may want to apply the resampler before the other transformers.
Ummm ... I think the introduction of fit_resample could be regarded as a change to API principles. I think it would be wise to slep it and I don't think there will be much difficulty reaching consensus.
Not sure what you mean, Adrin. I think the ResampledTrainer is pretty straightforward in that it can be a transformer but will transform the entire input with a model trained on resampled data. Its estimator can be a pipeline too.
Its estimator can be a pipeline too.
True.
A further decision point is whether a resampler must provide fit
I think I agree with @ogrisel about exploring the meta-estimator approach. I was against doing the API extension for a while, but then kind of changed my mind and started using imbalance-learn in class. Now I'm not entirely sure any more. I think we really should do a side-by-side comparison of what the user code would look like with either approach.
One this that would potentially be quite ugly is if the sampling is somewhere in the middle of the pipeline because then it would be
make_pipeline(..., ResamplingMeta(make_pipeline(...)))
which would make it hard to iterate over the steps. I'm not sure how common the use-case is, though. I guess you want to scale the data before applying Birch?
make_pipeline(StandardScaler(),
ResamplingMeta(make_pipeline(SelectKBest(), SVC()), resample=Birch())
is harder to work with than
make_pipeline(StandardScaler(), Birch(), SelectKBest(), SVC())
Is that a realistic use-case, though? Doing
make_pipeline(StandardScaler(), SelectKBest(),
ResamplingMeta(SVC(), resample=Birch())
might be sufficient? And in many cases the resampling doesn't require preprocessing so you could do
ResamplingMeta(make_pipeline(StandardScaler(), SelectKBest(), SVC()),
resample='minority')
which is not that bad either.
Do we want to apply the 80/20 rule?
I agree that there could be usability benefits to enabling this in a Pipeline, but it requires relatively messy code and hard-to-explain semantics e.g. in the case of fit_transform. In a Pipeline, a resampler looks like a transformer but behaves unlike one. We also haven't resolved the handling of sample props output from fit_resample in a Pipeline.
A meta-estimator is unambiguous. It is also expedient, and we could transition to a Pipeline solution in the future once we have a clear and functional fit_resample API. At this point, the specialised meta-estimator is my preferred approach towards releasing this functionality.
Another advantage of the Pipeline approach is easy chaining of resamplers. Here's a common usecase I can imagine:
clf = ResampledTrainer(
NaNRejector(), # removes samples containing NaN
ResampledTrainer(RandomUnderSampler(),
make_pipeline(StandardScaler(), SGDClassifier()))
)
vs
clf = make_pipeline(NaNRejector(), RandomUnderSampler(), StandardScaler(), SGDClassifer())
I don't think chaining resamplers is such a common use-case. I don't really think there's any question that it's easier to write code with a Pipeline. But it creates some maintenance and documentation headaches, and some ambiguities (the scope of where you use the resampled data and where you use the original data is a bit unclear).
Btw, @orausch, I wasn't expecting you to merge my PR to your branch quite so readily, for what it's worth. I was proposing what it would look like as an alternative.
We could go with @amueller's suggestion to consider both paths. But I'm still tempted to take the expedient approach of clearly defining fit_resample and providing a meta-estimator until we can make Pipeline work in a satisfying way.
IMO, we are having the same complexity than with the stacker. Having the meta-estimator will already answer the common need and even complex one (with some cumbersome code) but we don't restrain or change some API rule which will need time to converge.
We should probably frame these API changes but I think that having the meta estimator at first will be best.
Here is the SLEP regarding the Outlier Rejection API