Closed GaelVaroquaux closed 5 years ago
I think you should state more clearly that this is for within a chain or processing sequence of estimators. Doing many of these things is possible "by hand". the point is that you don't want to write custom connecting logic.
Just a note: misses the part that @agramfort wanted to write, user story code. @agramfort said he already has ugly solutions that would be nicer with the new interface. It would be good to see them.
It would probably be easy to come up with some outlier or resampling examples, but more "real live" ones would be cool.
overall looks great and I agree with the summary mostly. the "should this be a new class" discussion is touched upon only briefly, and the "should the number of return arguments vary" discussion, too.
ping @jnothman [no rush] @glouppe
I think you should state more clearly that this is for within a chain or processing sequence of estimators. Doing many of these things is possible "by hand". the point is that you don't want to write custom connecting logic.
Thanks. Addressed.
Just a note: misses the part that @agramfort wanted to write, user story code. @agramfort said he already has ugly solutions that would be nicer with the new interface. It would be good to see them.
ping @agramfort
I just realized that this proposal is adding a conceptual complexity (and thus burden for the user):
We will now have 3 methods that can modify X and are legitimate in a pipeline (interim names for now):
The question is: at fit time, in a pipeline, what has the priority on what, if not only one of these methods is present. In my head it is pretty clear (it's in the order listed above). But it might be confusing for the user.
We need fit_pipe and transform_pipe to exist both (and be different) to tackle things such as outlier detection. However, in most cases, we probably do not need transform_pipe
Hello!
I used to deal with such things too.
My current approach (which should work with stable sklearn) is to add some ABCs and monkey patch Pipeline
to use them.
ABC is abstract base class (it has hooks for isinstance
calls)
>>> lab_enc = LabelEncoder()
...
>>> isinstance(lab_enc, ABC_Y_Transformer)
True
Pipeline is patched so that if estimator step
passes the isinstance
check with ABC_Y_Transformer
, then fit
and transform
methods of step
will be called with Y
instead of X
.
My current solution for such ABC: hardcode all sklearn (and my own) classes in ABC_Y_Transformer..__instancecheck__
so that isinstance(obj, ABC_Y_Transformer)
call becames obj.__class__ in predefined_set_of_classes
.
This way I can put label transformers and X
transformers in a single pipeline.
The same logic can be applied to other cases, where X
and Y
should be modified together.
Instead of this hacky ABC, sklern can have two additional mixins - LabelTransformerMixin
and XYTransformerMixin
. And pipeline could be patched the same way as with ABC.
This way, modifications to sklearn code will be small, new version of sklearn will be (almost) compatible with previous ones and overall API changes are small.
Modifications needed: add mixins, add parent to current label transformers, make pipeline respect new mixins.
Could you elaborate the issues with the current design and why we can't transform y at the moment? We do have LabelBinarizer, although it's never used in a pipeline.
the code I'd like to write it:
est = make_pipeline(ClassBalancer(), SVC()) # balance classes by subsampling before calling SVC
est = make_pipeline(DataAugmentator(), RandomForest()) # data augmentation followed by RF
est = make_pipeline(BirchCoresets(), GMM()) # coresets and GMM for unsurpervised
est = make_pipeline(BirchCoresets(), Classifier()) # coresets and then surpervised than could do partial_fit if Classifier has it
then for out of core / file based usage:
est = make_pipeline(FilesToXy(), Classifier())
FilesToXy could do with partial or full "fit" conversation from files to X, y pairs eventually with sample props.
Amend if you see more.
Instead of this hacky ABC, sklern can have two additional mixins - LabelTransformerMixin and XYTransformerMixin.
What you are suggesting goes in the same direction of option B. However we have a design policy which is to try to rely as much as possible on interfaces rather than inheritance (or typing). The reason is that it make things more explicit: given the name of a method (which I see from the code I am typing or reading) I should be able to infer what a class does. Thus the need to introduce new methods.
FilesToXy could do with partial or full "fit" conversation from files to X, y pairs eventually with sample props.
Can you read my discussion (part 2.2.2, bullet 2) on why I think this is hard / impossible to integrate in model selection framework, and tell me if you understand / agree.
I have added a note on these issue at the beginning of opion B. Tell me if it is clear.
On Thu, Oct 22, 2015 at 07:54:12PM -0700, Mathieu Blondel wrote:
Could you elaborate the issues with the current design and why we can't transform y at the moment? We do have LabelBinarizer, although it's never used in a pipeline.
— Reply to this email directly or view it on GitHub.*
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
I think we need to write down how we would implement some example pipelines using the proposed API:
Note: pipeline 1 should ideally support batch and out-of-core fitting. In the out-of-core case I am afraid that the behavior of the stochastic behavior of the under sampler need to be aware it it's transforming data for fitting downstream estimators (in which case it need to actually perform under sampling) or if all the downstream estimators have been fitted and in which case it behaves like a pass-through (identity function) to make actual predictions.
Maybe the pipeline instance could be 'smart' and understand that it needs to disable the under sampler at prediction time. However we might face the same problem with other models such as a jitter / nudger used for sample augmentation that is only meaningful at fit time.
We will now have 3 methods that can modify X and are legitimate in a pipeline (interim names for now):
fit_pipe
transform_pipe
transform
The question is: at fit time, in a pipeline, what has the priority on what, if not only one of these methods is present. In my head it is pretty clear (it's in the order listed above). But it might be confusing for the user.
I think we should not have objects that have both transform
and transform_pipe
. And I think every method that has fit_pipe
also has transform_pipe
.
So an object either has fit
and transform
or fit_pipe
and transform_pipe
. Then the semantics are pretty clear to me. fit_pipe
is called during fit
and transform_pipe
is called during predict
, transform
and score
.
Fun question: does the pipeline have fit_pipe
and transform_pipe
?
@podshumok we want to modify both X
and y
in the pipeline.
@ogrisel I feel like we should leave partial_fit
out of this. I think it is pretty unrelated. I think the solution for partial_fit
will not depend on whether the pipeline calls transform
or transform_pipe
There was a pretty serious discussion on reddit about this type of stuff (and tying in multi-core) a few days ago . Just a pointer for reference/future, a sample of opinions about what issues might be important to users.
@amueller as far as objects not having both transform
and transform_pipe
- how do we handle legacy support e.g. PCA? Or does you mean just ignoring transform
on the objects inside this OOC pipeline?
What you are suggesting goes in the same direction of option B. However we have a design policy which is to try to rely as much as possible on interfaces rather than inheritance (or typing).
then maybe it is better to have method indicating whatever we transform data, labels, or both. instead of adding multiple new methods for actual fitting and/or transforming.
@podshumok we want to modify both X and y in the pipeline.
yep, I covered this point. The reason I focused on Y's only is that such transformers are already in sklearn, but they are not usable with pipelines.
It's not clear to me yet why the contract "n_samples doesn't change" is so useful. If we have a way to detect incorrect pipeline usage and do informative error reporting, maybe this contract is not so useful.
Assuming it's ok to change n_samples, how about the following method names:
(I don't fully comprehend the issues at stake here so sorry if my questions / comments are irrelevant)
Another pattern not discussed so far is:
The pipeline needs a way to know that the step concerns y and the last step is currently not covered in pipelines. However, I am not sure whether there are other use cases beyond simple y scaling.
Bearing in mind that objects could be used outside of pipelines, I think fit_filter
is a better name than fit_pipe
.
@mblondel I agree that "scaling y and scaling back after prediction" is a useful pattern. It is currently not included in this proposal, as a pipeline can not operate on the prediction of a previous step (only the last step can use "predict"). We could pretend the prediction is a new X
but that is ugly.
As an end user my dream would be to just wrap a Blaze data source (like a sqlite db) and have it work in some sort of self- updating reactive out of core pipeline.
Odo and blaze have some chunking/streaming functionality that might work for this.
I've seen this sort of thing mentioned before by other users (can't recall the exact post, whether in a PR or the google group)... so it atleast should have a +1 :)
Is it dead?
no, it's unfinished ;)
We're facing some difficulties in MNE-Python related to the present proposal and @GaelVaroquaux suggested that it may be useful to describe our usecases so as to motivate your decision. I hope this won't be too irrelevant.
We often need to apply an estimator on a sliding window over X
of shape (n_samples, n_features)
where the samples are consecutive "time samples" and are locally correlated with one another. We're thus constructing a feature space by using adjacent samples. Consequently we're trying to fit a single coef
vector of shape (n_time_samples * n_features)
which predict a y
vector (this is analogous to auto-regressive models). This is a non-conventional design in itself, but the relevant part wrt to the present issue comes next.
Specifically, to construct such rolling time window, we thus needs all samples. However, in the following steps, many samples happen to be detrimental to the fitting (e.g. typically, many samples have all their features = 0). We thus need to exclude these samples from the subsequent steps in the pipeline.
It is not entirely clear to me how/whether the current proposal suggests to inform each step of the pipeline about whether each values of y
was known, or has been modified by a previous step.
Specifically, one may need to do:
pipe = make_pipeline_xy(
EstimatorFittedOnKnownYOnly(transform_values=np.nan), # modified y==np.nan by prediction
EstimatorFittedOnModifiedY(weight_y_known=.80), # gives higher weights for known than modified y
)
I'm unclear about what will be the 'scoring' consequences of your each alternative of the proposal.
So far, I've systematically written meta-estimators to handle these cases, but it's true that they tend to become a bit nested, and difficult to handle.
During the development of the imbalanced-learn, we actually confront the issue since we need to modify both X
and y
.
After discussing briefly with @GaelVaroquaux, he redirected us to the right discussion thread. Therefore, we thought that it could be a good idea to present succinctly the choice of API that we did. Subsequently, our aim is threefold: (i) we would like to be sure that our API will follow the choice of the scikit-learn community when such transformer will get in the toolbox; (ii) we would like to transfer all implemented algorithms which meet the inclusion criteria of scitkit-learn; and (iii) we would like to continue implementing "cutting-edge" methods which would be complementary to scikit-learn.
SamplerMixin
We decided to create a new Mixin
estimator called SamplerMixin
since that all the method aimed at resampling X
and y
. The estimator is made of 3 methods:
fit
: we use this method for checking and computing statistics which are needed at sample
time. In fact, all these processes could be moved in sample
for the current algorithm, which come back to the discussion at 2.2.2.1.sample
: we use this method to perform the resampling.fit_sample
: fit
followed by sample
.BaseBinarySampler
vs. BaseMulticlassSampler
Since that some of the algorithms are not designed to work with multiple classes, we created 2 base classes inheriting from SamplerMixin
to check for the target type and potentially raise warnings. This is permissive for the moment to be compatible with the scikit-learn base estimator.
We modified the scikit-learn Pipeline
to take into account any sampler in a pipeline.
The sampling is only performed at fit
time and not at predict
time. It appears that this is what one would actually want to do while correcting balancing problem; change the balancing at training but test on an imbalanced dataset.
Since TransformerMixin
and SamplerMixin
are distincts, we didn't got into trouble with the point 2.2.2.2-2 but at the cost of a new base class.
I hope that I don't forget anything. @chkoar @dvro correct me if this is the case.
@glemaitre I'd also add that imbalanced-learn implementation enables us to work with different approaches. Given a set of samples S = (X, y):
As you can see, some of these transformations could be implemented by returning a list of index, so that, X, y = X[idx,:], y[idx]
. However, other transformations actually require returning a new X and y (Prototype Generation and Relabeling). Therefore, (the way I see it) a fit_filter
is would not work in this second case.
That being said, I really like @GaelVaroquaux fit_modify implementation, that also works in these 5 cases. Let us know how we can help get this released.
I think we have a few distinct cases that should be handled separately (if at all).
Resampling and outlier removal at fit time can be done with CV splitters (or by transforming sample_weight
, which is another question...). They're not a particularly interesting case. Other data reduction, e.g. via BIRCH, is interesting, and yet it makes sense for all these cases to share an interface and be supported by Pipeline
.
I think for all of these cases imbalanced-learn has got it more-or-less right. Sampler objects should not do anything at predict/transform time, only at fit
time. From the perspective of Pipeline
, only fit_sample
matters. @glemaitre, do you have good use-cases for having a separate sample
method? I can imagine its use in out-of-core learning where we want to maintain outlier statistics for a first batch, but again pipeline does not support this case.
Until we find a compelling use-case, I think fit_sample
and [fit_]transform
on the same object should be disallowed by Pipeline
. I am inclined to implement the imbalanced-learn API.
[Edit 2 days later: I discovered I proposed much the same at https://gist.github.com/jnothman/274710f945e311697466]
Transforming targets, as in "scaling y and scaling back after prediction" mentioned above, seems a better fit for a meta-estimator, as it:
I think we should have a ?TransformedTargetPredictor
:
class TransformedTargetPredictor(BaseEstimator):
def __init__(self, estimator, func, inverse_func=None):
...
def fit(self, X, y=None, **kwargs):
self.estimator_ = clone(self.estimator)
self.estimator_.fit(X, func(y))
return self
def predict(self, X):
y = self.estimator_.predict(X)
if self.inverse_func is None:
return y
return self.inverse_func(y)
This would at least establish a best practice for such things, allowing people to build similar transformations (for what use case, I don't know) dependent on X, y
.
This is the challenging case, as meta-estimators do not feel appropriate. However, it seems that we're breaking the intuition of our pipeline/scoring design if we allow this kind of transformation within a Pipeline
. The contract that the output of a Pipeline
's predict
should look like the y
that is passed into its fit
seems an essential one to me. I'm therefore not convinced that it deserves sharing the same API as resampling. Hmm.
@glemaitre, do you have good use-cases for having a separate
sample
method?
I would say no. A fit_sample
seems more appropriate as you suggested.
Btw, an implementation and example of using a meta-CV splitter for resampling is at https://gist.github.com/jnothman/6bddbbcca71bdf9fd37e8495d70b42e8. Again, the main problem is its not supporting other training modification such as data reduction or insertion of label noise.
FWIW, https://github.com/dvro/scikit-protopy/blob/master/protopy has an equivalent of "fit_[re]sample" called "reduce_data"
@jnothman yes. @dvro is the author of the scikit-protopy
and he has imnplemented some of the imbalanced-learn
algorithms according to the fit_sample
interface.
One arguable use case not considered here is (parameter search in) semi-supervised learning. Assume we want to select the LDA model over a large collection of texts which maximises text classification performance with a small amount of labelled data (https://github.com/scikit-learn/scikit-learn/issues/8468). In this case, the transformers need the unlabelled data at fit time, but the classifier does not, nor do they in transform/predict time.
A potential, if perhaps unparadigmatic, solution may combine the following:
class DiscardUnlabeled(BaseEstimator):
def fit_resample(self, X, y):
return X[y != -1], y[y != -1]
class SemiSupervisedSplit:
def __init__(self, base_cv):
self.base_cv = base_cv
def split(self, X, y):
has_label = y != -1
has_label_idx = np.flatnonzero(has_label)
no_label_idx = np.flatnonzero(~has_label)
for train_idx, test_idx in self.base_cv.split(X[has_label_idx], y[has_label_idx]):
yield np.concatenate(no_label_idx, has_label_idx[train_idx]), has_label_idx[test_idx]
I feel more and more that the most pragmatic solution might be by now to not introduce a new method, but instead allow X to be DataFrame that also contains the labels. That doesn't solve everything but it makes for a much more natural workflow for most people.
Sent from phone. Please excuse spelling and brevity.
On Feb 27, 2017 8:15 PM, "Joel Nothman" notifications@github.com wrote:
One arguable use case not considered here is (parameter search in) semi-supervised learning. Assume we want to select the LDA model over a large collection of texts which maximises text classification performance with a small amount of labelled data. In this case, the transformers need the unlabelled data at fit time, but the classifier does not, nor does predict time.
A potential, if perhaps unparadigmatic, solution may combine the following:
class DiscardUnlabelled(BaseEstimator): def fit_resample(self, X, y): return X[y != -1], y[y != -1] class SemiSupervisedSplit: def init(self, base_cv): self.base_cv = base_cv def split(self, X, y): has_label = y != -1 has_label_idx = np.flatnonzero(has_label) no_label_idx = np.flatnonzero(~has_label) for train_idx, test_idx in self.base_cv.split(X[has_label_idx], y[has_label_idx]): yield np.concatenate(no_label_idx, has_label_idx[train_idx]), has_label_idx[test_idx]
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/2#issuecomment-282910463, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFt-74PuX85iEZ0_nmEAQkhlZOdaqks5rg3UVgaJpZM4GTxW0 .
I feel more and more that the most pragmatic solution might be by now to not introduce a new method, but instead allow X to be DataFrame that also contains the labels. That doesn't solve everything but it makes for a much more natural workflow for most people.
I see a few limitations:
What if X is sparse (or anything others than columns, eg free text)?
What if y is more complex than a vector? (multi label, for instance)
I worry a lot about data leakage in cross-validation. Scikit-learn is quite good at preventing that (and people keep complaining that they cannot do their weird cross-validation thing that has implicit leakage :) ).
Sparse X and text can be handled by DataFrames, tough I haven't recently checked on how good the sparse support is. Multilabel y could be handled by a hierarchical index. I agree that separating X and y less than our current API might have downsides. More flexibility allows people to screw up more. Can you give an example of the cross-validation? The .split method had access to y so you should be able to do anything.
Sent from phone. Please excuse spelling and brevity.
On Mar 31, 2017 10:24 AM, "Gael Varoquaux" notifications@github.com wrote:
I feel more and more that the most pragmatic solution might be by now to not introduce a new method, but instead allow X to be DataFrame that also contains the labels. That doesn't solve everything but it makes for a much more natural workflow for most people.
I see a few limitations:
What if X is sparse (or anything others than columns, eg free text)?
What if y is more complex than a vector? (multi label, for instance)
I worry a lot about data leakage in cross-validation. Scikit-learn is quite good at preventing that (and people keep complaining that they cannot do their weird cross-validation thing that has implicit leakage :) ).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/2#issuecomment-290726101, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFg1ljv_np9GiajRuYhBj9gHofCzBks5rrQxwgaJpZM4GTxW0 .
Sparse X and text can be handled by DataFrames, tough I haven't recently checked on how good the sparse support is.
Not great, AFAIK, but I'd love to be proven wrong.
Multilabel y could be handled by a hierarchical index.
It makes things much harder and much more clumsy. Pandas is powerful, but hard to use.
Can you give an example of the cross-validation?
Mixing X and y in a transformer.
Text was never a problem iirc. Sparse data requires a copy and looks a bit half-baked: http://pandas.pydata.org/pandas-docs/stable/sparse.html
I don't think hierarchical indices are clumsy, though they require some understanding of pandas.
I don't think hierarchical indices are clumsy, though they require some understanding of pandas.
My students keep making errors that are very very hard to debug (I do too). They are handy, but lead to woodoo-style code. I don't think that it's a good basics for an API for scikit-learn.
I don't see how this fixes things, for many cases. One major feature of pipeline is that it trains the final predictor to output predictions comparable to the evaluation ground truth. How would this be ensured if a dataframe is used?
On 1 Apr 2017 2:01 am, "Gael Varoquaux" notifications@github.com wrote:
I don't think hierarchical indices are clumsy, though they require some understanding of pandas.
My students keep making errors that are very very hard to debug (I do too). They are handy, but lead to woodoo-style code. I don't think that it's a good basics for an API for scikit-learn.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/pull/2#issuecomment-290736294, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6xSijyJmtE2rntORUtk4F8sRLTacks5rrRUMgaJpZM4GTxW0 .
This is almost ready for early discussion:
Best way to look at the proposal is: https://github.com/GaelVaroquaux/enhancement_proposals/blob/transform_y/slep001/discussion.rst