scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
48 stars 34 forks source link

WIP: transformers that modify y #2

Closed GaelVaroquaux closed 5 years ago

GaelVaroquaux commented 9 years ago

This is almost ready for early discussion:

Best way to look at the proposal is: https://github.com/GaelVaroquaux/enhancement_proposals/blob/transform_y/slep001/discussion.rst

amueller commented 9 years ago

I think you should state more clearly that this is for within a chain or processing sequence of estimators. Doing many of these things is possible "by hand". the point is that you don't want to write custom connecting logic.

amueller commented 9 years ago

Just a note: misses the part that @agramfort wanted to write, user story code. @agramfort said he already has ugly solutions that would be nicer with the new interface. It would be good to see them.

It would probably be easy to come up with some outlier or resampling examples, but more "real live" ones would be cool.

amueller commented 9 years ago

overall looks great and I agree with the summary mostly. the "should this be a new class" discussion is touched upon only briefly, and the "should the number of return arguments vary" discussion, too.

amueller commented 9 years ago

ping @jnothman [no rush] @glouppe

GaelVaroquaux commented 9 years ago

I think you should state more clearly that this is for within a chain or processing sequence of estimators. Doing many of these things is possible "by hand". the point is that you don't want to write custom connecting logic.

Thanks. Addressed.

GaelVaroquaux commented 9 years ago

Just a note: misses the part that @agramfort wanted to write, user story code. @agramfort said he already has ugly solutions that would be nicer with the new interface. It would be good to see them.

ping @agramfort

GaelVaroquaux commented 9 years ago

I just realized that this proposal is adding a conceptual complexity (and thus burden for the user):

We will now have 3 methods that can modify X and are legitimate in a pipeline (interim names for now):

The question is: at fit time, in a pipeline, what has the priority on what, if not only one of these methods is present. In my head it is pretty clear (it's in the order listed above). But it might be confusing for the user.

We need fit_pipe and transform_pipe to exist both (and be different) to tackle things such as outlier detection. However, in most cases, we probably do not need transform_pipe

code-of-kpp commented 9 years ago

Hello!

I used to deal with such things too.

My current approach (which should work with stable sklearn) is to add some ABCs and monkey patch Pipeline to use them.

ABC is abstract base class (it has hooks for isinstance calls)

>>> lab_enc = LabelEncoder()
...
>>> isinstance(lab_enc, ABC_Y_Transformer)
True

Pipeline is patched so that if estimator step passes the isinstance check with ABC_Y_Transformer, then fit and transform methods of step will be called with Y instead of X.

My current solution for such ABC: hardcode all sklearn (and my own) classes in ABC_Y_Transformer..__instancecheck__ so that isinstance(obj, ABC_Y_Transformer) call becames obj.__class__ in predefined_set_of_classes.

This way I can put label transformers and X transformers in a single pipeline.

The same logic can be applied to other cases, where X and Y should be modified together.

Instead of this hacky ABC, sklern can have two additional mixins - LabelTransformerMixin and XYTransformerMixin. And pipeline could be patched the same way as with ABC.

This way, modifications to sklearn code will be small, new version of sklearn will be (almost) compatible with previous ones and overall API changes are small.

Modifications needed: add mixins, add parent to current label transformers, make pipeline respect new mixins.

mblondel commented 9 years ago

Could you elaborate the issues with the current design and why we can't transform y at the moment? We do have LabelBinarizer, although it's never used in a pipeline.

agramfort commented 9 years ago

the code I'd like to write it:

est = make_pipeline(ClassBalancer(), SVC())  # balance classes by subsampling before calling SVC
est = make_pipeline(DataAugmentator(), RandomForest())  #  data augmentation followed by RF
est = make_pipeline(BirchCoresets(), GMM())  # coresets and GMM for unsurpervised
est = make_pipeline(BirchCoresets(), Classifier())  # coresets and then surpervised than could do partial_fit if Classifier has it

then for out of core / file based usage:

est = make_pipeline(FilesToXy(), Classifier())

FilesToXy could do with partial or full "fit" conversation from files to X, y pairs eventually with sample props.

Amend if you see more.

GaelVaroquaux commented 9 years ago

Instead of this hacky ABC, sklern can have two additional mixins - LabelTransformerMixin and XYTransformerMixin.

What you are suggesting goes in the same direction of option B. However we have a design policy which is to try to rely as much as possible on interfaces rather than inheritance (or typing). The reason is that it make things more explicit: given the name of a method (which I see from the code I am typing or reading) I should be able to infer what a class does. Thus the need to introduce new methods.

GaelVaroquaux commented 9 years ago

FilesToXy could do with partial or full "fit" conversation from files to X, y pairs eventually with sample props.

Can you read my discussion (part 2.2.2, bullet 2) on why I think this is hard / impossible to integrate in model selection framework, and tell me if you understand / agree.

GaelVaroquaux commented 9 years ago

I have added a note on these issue at the beginning of opion B. Tell me if it is clear.

On Thu, Oct 22, 2015 at 07:54:12PM -0700, Mathieu Blondel wrote:

Could you elaborate the issues with the current design and why we can't transform y at the moment? We do have LabelBinarizer, although it's never used in a pipeline.

— Reply to this email directly or view it on GitHub.*

Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
ogrisel commented 9 years ago

I think we need to write down how we would implement some example pipelines using the proposed API:

Note: pipeline 1 should ideally support batch and out-of-core fitting. In the out-of-core case I am afraid that the behavior of the stochastic behavior of the under sampler need to be aware it it's transforming data for fitting downstream estimators (in which case it need to actually perform under sampling) or if all the downstream estimators have been fitted and in which case it behaves like a pass-through (identity function) to make actual predictions.

Maybe the pipeline instance could be 'smart' and understand that it needs to disable the under sampler at prediction time. However we might face the same problem with other models such as a jitter / nudger used for sample augmentation that is only meaningful at fit time.

amueller commented 9 years ago

We will now have 3 methods that can modify X and are legitimate in a pipeline (interim names for now):

fit_pipe
transform_pipe
transform

The question is: at fit time, in a pipeline, what has the priority on what, if not only one of these methods is present. In my head it is pretty clear (it's in the order listed above). But it might be confusing for the user.

I think we should not have objects that have both transform and transform_pipe. And I think every method that has fit_pipe also has transform_pipe. So an object either has fit and transform or fit_pipe and transform_pipe. Then the semantics are pretty clear to me. fit_pipe is called during fit and transform_pipe is called during predict, transform and score.

amueller commented 9 years ago

Fun question: does the pipeline have fit_pipe and transform_pipe ?

amueller commented 9 years ago

@podshumok we want to modify both X and y in the pipeline.

amueller commented 9 years ago

@ogrisel I feel like we should leave partial_fit out of this. I think it is pretty unrelated. I think the solution for partial_fit will not depend on whether the pipeline calls transform or transform_pipe

kastnerkyle commented 9 years ago

There was a pretty serious discussion on reddit about this type of stuff (and tying in multi-core) a few days ago . Just a pointer for reference/future, a sample of opinions about what issues might be important to users.

kastnerkyle commented 9 years ago

@amueller as far as objects not having both transform and transform_pipe - how do we handle legacy support e.g. PCA? Or does you mean just ignoring transform on the objects inside this OOC pipeline?

code-of-kpp commented 9 years ago

What you are suggesting goes in the same direction of option B. However we have a design policy which is to try to rely as much as possible on interfaces rather than inheritance (or typing).

then maybe it is better to have method indicating whatever we transform data, labels, or both. instead of adding multiple new methods for actual fitting and/or transforming.

@podshumok we want to modify both X and y in the pipeline.

yep, I covered this point. The reason I focused on Y's only is that such transformers are already in sklearn, but they are not usable with pipelines.

mblondel commented 9 years ago

It's not clear to me yet why the contract "n_samples doesn't change" is so useful. If we have a way to detect incorrect pipeline usage and do informative error reporting, maybe this contract is not so useful.

Assuming it's ok to change n_samples, how about the following method names:

(I don't fully comprehend the issues at stake here so sorry if my questions / comments are irrelevant)

mblondel commented 9 years ago

Another pattern not discussed so far is:

The pipeline needs a way to know that the step concerns y and the last step is currently not covered in pipelines. However, I am not sure whether there are other use cases beyond simple y scaling.

mblondel commented 9 years ago

Bearing in mind that objects could be used outside of pipelines, I think fit_filter is a better name than fit_pipe.

amueller commented 9 years ago

@mblondel I agree that "scaling y and scaling back after prediction" is a useful pattern. It is currently not included in this proposal, as a pipeline can not operate on the prediction of a previous step (only the last step can use "predict"). We could pretend the prediction is a new X but that is ugly.

datnamer commented 9 years ago

As an end user my dream would be to just wrap a Blaze data source (like a sqlite db) and have it work in some sort of self- updating reactive out of core pipeline.

Odo and blaze have some chunking/streaming functionality that might work for this.

I've seen this sort of thing mentioned before by other users (can't recall the exact post, whether in a PR or the google group)... so it atleast should have a +1 :)

code-of-kpp commented 8 years ago

Is it dead?

amueller commented 8 years ago

no, it's unfinished ;)

kingjr commented 8 years ago

We're facing some difficulties in MNE-Python related to the present proposal and @GaelVaroquaux suggested that it may be useful to describe our usecases so as to motivate your decision. I hope this won't be too irrelevant.

1. Initial steps needs all samples, subsequent need a subset of samples

We often need to apply an estimator on a sliding window over X of shape (n_samples, n_features) where the samples are consecutive "time samples" and are locally correlated with one another. We're thus constructing a feature space by using adjacent samples. Consequently we're trying to fit a single coef vector of shape (n_time_samples * n_features) which predict a y vector (this is analogous to auto-regressive models). This is a non-conventional design in itself, but the relevant part wrt to the present issue comes next.

Specifically, to construct such rolling time window, we thus needs all samples. However, in the following steps, many samples happen to be detrimental to the fitting (e.g. typically, many samples have all their features = 0). We thus need to exclude these samples from the subsequent steps in the pipeline.

2. Semi supervision

It is not entirely clear to me how/whether the current proposal suggests to inform each step of the pipeline about whether each values of y was known, or has been modified by a previous step.

Specifically, one may need to do:

pipe = make_pipeline_xy(
      EstimatorFittedOnKnownYOnly(transform_values=np.nan),  # modified y==np.nan by prediction
      EstimatorFittedOnModifiedY(weight_y_known=.80),  # gives higher weights for known than modified y
)

3. Scoring

I'm unclear about what will be the 'scoring' consequences of your each alternative of the proposal.

So far, I've systematically written meta-estimators to handle these cases, but it's true that they tend to become a bit nested, and difficult to handle.

glemaitre commented 8 years ago

During the development of the imbalanced-learn, we actually confront the issue since we need to modify both X and y.

After discussing briefly with @GaelVaroquaux, he redirected us to the right discussion thread. Therefore, we thought that it could be a good idea to present succinctly the choice of API that we did. Subsequently, our aim is threefold: (i) we would like to be sure that our API will follow the choice of the scikit-learn community when such transformer will get in the toolbox; (ii) we would like to transfer all implemented algorithms which meet the inclusion criteria of scitkit-learn; and (iii) we would like to continue implementing "cutting-edge" methods which would be complementary to scikit-learn.

1. Base class

1.1 SamplerMixin

We decided to create a new Mixin estimator called SamplerMixin since that all the method aimed at resampling X and y. The estimator is made of 3 methods:

1.2 BaseBinarySampler vs. BaseMulticlassSampler

Since that some of the algorithms are not designed to work with multiple classes, we created 2 base classes inheriting from SamplerMixin to check for the target type and potentially raise warnings. This is permissive for the moment to be compatible with the scikit-learn base estimator.

2. Pipeline object

We modified the scikit-learn Pipeline to take into account any sampler in a pipeline. The sampling is only performed at fit time and not at predict time. It appears that this is what one would actually want to do while correcting balancing problem; change the balancing at training but test on an imbalanced dataset. Since TransformerMixin and SamplerMixin are distincts, we didn't got into trouble with the point 2.2.2.2-2 but at the cost of a new base class.

I hope that I don't forget anything. @chkoar @dvro correct me if this is the case.

dvro commented 8 years ago

@glemaitre I'd also add that imbalanced-learn implementation enables us to work with different approaches. Given a set of samples S = (X, y):

  1. UndersamplingPrototype Selection: returns a new set of samples S', S' in S.
  2. Undersampling/Prototype Generation: returns a new set of samples S', |S'| < |S|.
  3. Oversampling/Selection: returns a new set of samples S', |S'| > |S|.
  4. Oversampling/Generation: returns a new set of samples S', S in S'.
  5. Relabeling: returns the same a relabeled set of samples S'.

As you can see, some of these transformations could be implemented by returning a list of index, so that, X, y = X[idx,:], y[idx]. However, other transformations actually require returning a new X and y (Prototype Generation and Relabeling). Therefore, (the way I see it) a fit_filter is would not work in this second case.

That being said, I really like @GaelVaroquaux fit_modify implementation, that also works in these 5 cases. Let us know how we can help get this released.

jnothman commented 7 years ago

I think we have a few distinct cases that should be handled separately (if at all).

Data reduction

Resampling and outlier removal at fit time can be done with CV splitters (or by transforming sample_weight, which is another question...). They're not a particularly interesting case. Other data reduction, e.g. via BIRCH, is interesting, and yet it makes sense for all these cases to share an interface and be supported by Pipeline.

I think for all of these cases imbalanced-learn has got it more-or-less right. Sampler objects should not do anything at predict/transform time, only at fit time. From the perspective of Pipeline, only fit_sample matters. @glemaitre, do you have good use-cases for having a separate sample method? I can imagine its use in out-of-core learning where we want to maintain outlier statistics for a first batch, but again pipeline does not support this case.

Until we find a compelling use-case, I think fit_sample and [fit_]transform on the same object should be disallowed by Pipeline. I am inclined to implement the imbalanced-learn API.

[Edit 2 days later: I discovered I proposed much the same at https://gist.github.com/jnothman/274710f945e311697466]

Target transformation

Transforming targets, as in "scaling y and scaling back after prediction" mentioned above, seems a better fit for a meta-estimator, as it:

I think we should have a ?TransformedTargetPredictor:

class TransformedTargetPredictor(BaseEstimator):
    def __init__(self, estimator, func, inverse_func=None):
        ...
    def fit(self, X, y=None, **kwargs):
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, func(y))
        return self
    def predict(self, X):
        y = self.estimator_.predict(X)
        if self.inverse_func is None:
            return y
        return self.inverse_func(y)

This would at least establish a best practice for such things, allowing people to build similar transformations (for what use case, I don't know) dependent on X, y.

Data conversion

This is the challenging case, as meta-estimators do not feel appropriate. However, it seems that we're breaking the intuition of our pipeline/scoring design if we allow this kind of transformation within a Pipeline. The contract that the output of a Pipeline's predict should look like the y that is passed into its fit seems an essential one to me. I'm therefore not convinced that it deserves sharing the same API as resampling. Hmm.

glemaitre commented 7 years ago

@glemaitre, do you have good use-cases for having a separate sample method?

I would say no. A fit_sample seems more appropriate as you suggested.

jnothman commented 7 years ago

Btw, an implementation and example of using a meta-CV splitter for resampling is at https://gist.github.com/jnothman/6bddbbcca71bdf9fd37e8495d70b42e8. Again, the main problem is its not supporting other training modification such as data reduction or insertion of label noise.

jnothman commented 7 years ago

FWIW, https://github.com/dvro/scikit-protopy/blob/master/protopy has an equivalent of "fit_[re]sample" called "reduce_data"

chkoar commented 7 years ago

@jnothman yes. @dvro is the author of the scikit-protopy and he has imnplemented some of the imbalanced-learn algorithms according to the fit_sample interface.

jnothman commented 7 years ago

One arguable use case not considered here is (parameter search in) semi-supervised learning. Assume we want to select the LDA model over a large collection of texts which maximises text classification performance with a small amount of labelled data (https://github.com/scikit-learn/scikit-learn/issues/8468). In this case, the transformers need the unlabelled data at fit time, but the classifier does not, nor do they in transform/predict time.

A potential, if perhaps unparadigmatic, solution may combine the following:

class DiscardUnlabeled(BaseEstimator):
    def fit_resample(self, X, y):
        return X[y != -1], y[y != -1]

class SemiSupervisedSplit:
    def __init__(self, base_cv):
        self.base_cv = base_cv
    def split(self, X, y):
        has_label = y != -1
        has_label_idx = np.flatnonzero(has_label)
        no_label_idx = np.flatnonzero(~has_label)
        for train_idx, test_idx in self.base_cv.split(X[has_label_idx], y[has_label_idx]):
            yield np.concatenate(no_label_idx, has_label_idx[train_idx]), has_label_idx[test_idx]
amueller commented 7 years ago

I feel more and more that the most pragmatic solution might be by now to not introduce a new method, but instead allow X to be DataFrame that also contains the labels. That doesn't solve everything but it makes for a much more natural workflow for most people.

Sent from phone. Please excuse spelling and brevity.

On Feb 27, 2017 8:15 PM, "Joel Nothman" notifications@github.com wrote:

One arguable use case not considered here is (parameter search in) semi-supervised learning. Assume we want to select the LDA model over a large collection of texts which maximises text classification performance with a small amount of labelled data. In this case, the transformers need the unlabelled data at fit time, but the classifier does not, nor does predict time.

A potential, if perhaps unparadigmatic, solution may combine the following:

class DiscardUnlabelled(BaseEstimator): def fit_resample(self, X, y): return X[y != -1], y[y != -1] class SemiSupervisedSplit: def init(self, base_cv): self.base_cv = base_cv def split(self, X, y): has_label = y != -1 has_label_idx = np.flatnonzero(has_label) no_label_idx = np.flatnonzero(~has_label) for train_idx, test_idx in self.base_cv.split(X[has_label_idx], y[has_label_idx]): yield np.concatenate(no_label_idx, has_label_idx[train_idx]), has_label_idx[test_idx]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/2#issuecomment-282910463, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFt-74PuX85iEZ0_nmEAQkhlZOdaqks5rg3UVgaJpZM4GTxW0 .

GaelVaroquaux commented 7 years ago

I feel more and more that the most pragmatic solution might be by now to not introduce a new method, but instead allow X to be DataFrame that also contains the labels. That doesn't solve everything but it makes for a much more natural workflow for most people.

I see a few limitations:

amueller commented 7 years ago

Sparse X and text can be handled by DataFrames, tough I haven't recently checked on how good the sparse support is. Multilabel y could be handled by a hierarchical index. I agree that separating X and y less than our current API might have downsides. More flexibility allows people to screw up more. Can you give an example of the cross-validation? The .split method had access to y so you should be able to do anything.

Sent from phone. Please excuse spelling and brevity.

On Mar 31, 2017 10:24 AM, "Gael Varoquaux" notifications@github.com wrote:

I feel more and more that the most pragmatic solution might be by now to not introduce a new method, but instead allow X to be DataFrame that also contains the labels. That doesn't solve everything but it makes for a much more natural workflow for most people.

I see a few limitations:

  • What if X is sparse (or anything others than columns, eg free text)?

  • What if y is more complex than a vector? (multi label, for instance)

  • I worry a lot about data leakage in cross-validation. Scikit-learn is quite good at preventing that (and people keep complaining that they cannot do their weird cross-validation thing that has implicit leakage :) ).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/2#issuecomment-290726101, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFg1ljv_np9GiajRuYhBj9gHofCzBks5rrQxwgaJpZM4GTxW0 .

GaelVaroquaux commented 7 years ago

Sparse X and text can be handled by DataFrames, tough I haven't recently checked on how good the sparse support is.

Not great, AFAIK, but I'd love to be proven wrong.

Multilabel y could be handled by a hierarchical index.

It makes things much harder and much more clumsy. Pandas is powerful, but hard to use.

Can you give an example of the cross-validation?

Mixing X and y in a transformer.

amueller commented 7 years ago

Text was never a problem iirc. Sparse data requires a copy and looks a bit half-baked: http://pandas.pydata.org/pandas-docs/stable/sparse.html

I don't think hierarchical indices are clumsy, though they require some understanding of pandas.

GaelVaroquaux commented 7 years ago

I don't think hierarchical indices are clumsy, though they require some understanding of pandas.

My students keep making errors that are very very hard to debug (I do too). They are handy, but lead to woodoo-style code. I don't think that it's a good basics for an API for scikit-learn.

jnothman commented 7 years ago

I don't see how this fixes things, for many cases. One major feature of pipeline is that it trains the final predictor to output predictions comparable to the evaluation ground truth. How would this be ensured if a dataframe is used?

On 1 Apr 2017 2:01 am, "Gael Varoquaux" notifications@github.com wrote:

I don't think hierarchical indices are clumsy, though they require some understanding of pandas.

My students keep making errors that are very very hard to debug (I do too). They are handy, but lead to woodoo-style code. I don't think that it's a good basics for an API for scikit-learn.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/2#issuecomment-290736294, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6xSijyJmtE2rntORUtk4F8sRLTacks5rrRUMgaJpZM4GTxW0 .