Closed bmritz closed 8 years ago
@chkoar On the error itself, you should be able to have some more insights.
However, I don't see the utility of the example. Is it only a dummy example? Embedding the pipeline in another pipeline is equivalent to have a single linear one in the example that you gave.
@chkoar is the issue due to fact that a pipeline has both fit_transform
and fit_sample
during the pre_transform
-> check there
The actual code I posted is a toy example, but I do use embedded pipelines fairly regularly.
I usually use embedded pipelines why trying out different preprocessing schemes and different feature selections. For example, let's say I wanted to try out MinMaxScaler and Standard Scaler and no scaling on a subset of features. I'd set up a pipeline that uses preprocessing.FunctionTransformer to subset the columns. and then inside a for loop wrap that Pipeline into another pipeline with the second step being the scaler for that iteration of the loop. This way I can create many pipelines off of a "base" pipeline, and keep track of them fairly easily.
@bmritz That make sense. I was using list of the object an creating the pipeline inside loops for the same thing.
Thanks for reporting.
It is more complicated than one may expects because we worked on the sklearn
's Pipeline
object to reuse linear transformations as @glemaitre mentioned.
We expect samplers or transformers in the Pipeline
as it is stated in the docstring and we should warn the user for this.
Exchanging if
with elif
in L130 solves the problem in fit
.
With the current design the samplers work in the training phase only. So actually the sample
method is a training method. It requires the target y
. @glemaitre with the old API where we call transform
in samplers without parameters it would more easier in this case, I think.
@bmritz did you try to put the sampler in its own step outside of th nested Pipeline
?
Yes that is what I ended up doing -- Because I wanted to understand the effect of resampling on my final model, I resampled outside the pipeline and then creating two pipelines off of two training sets, one un-resampled and one resampled.
It makes sense that resamplers work only on training phase, for validation or test it there would be no need to resample, so I see your logic there.
On Thu, Oct 13, 2016 at 2:50 AM, chkoar notifications@github.com wrote:
It is more complicated than one may expects because we worked on the sklearn's Pipeline object to reuse linear transformations as @glemaitre https://github.com/glemaitre mentioned.
We expect samplers and transformers in the Pipeline as it is stated in the docstring https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py#L33 and we should warn the user for this.
With the current design the samplers work in the training phase only. So actually the sample method is a training method. It requires the target y. @glemaitre https://github.com/glemaitre with the old API where we call transform in samplers without parameters it would more easier in this case, I think.
Did you try to put the sampler in its own step outside a Pipeline?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/issues/162#issuecomment-253428883, or mute the thread https://github.com/notifications/unsubscribe-auth/AHalG4K6M7dCszyWBeGzPPh3pgWee1rQks5qzdSygaJpZM4KU9TA .
Brian Ritz Data Scientist
(m) 219.808.4648 <630.965.4686>
Aunalytics • rethink data
Catalyst One @ Ignition Park 460 Stull St., Suite 100 South Bend, IN 46601
online aunalytics.com linkedin linkedin.com/company/aunalytics
I resampled outside the pipeline
@bmritz if I am not missing something, in your case I would create an IdentityResampler
that samples what it has. Then I would use grid searching and validation curves to see how the model performs by varying the resampler
parameter of the pipeline
Pipeline(steps=[
('std', preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)),
('resampler', SMOTE(k=5, kind='regular', m=10, n_jobs=-1, out_step=0.5, random_state=None, ratio='auto')),
('knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'))
]).fit(X,y)
@dvro, @glemaitre Do we need an IdentityResampler
?
That could make sense to add it for consistency.
@bmritz like this
@chkoar Can you label this issue to know when we will address it if needed.
@glemaitre done in #166
If I create a "hierarchical pipeline" ( a pipeline where one step is another pipeline), then the Pipeline will raise an AttributeError on .fit() because it reads the imblearn.pipeline.Pipeline object as having a .fit_transform() attribute, and thus sending it to .fit_transform() where it tries to call .fit_transform() on an imblearn object within one of the steps.
The following will reproduce the error:
FYI the
.predict()
method on the pipeline also raises an exception with embedded pipelines because it passes over the pipeline step without transforming because it sees that the Pipeline step has afit_sample
attribute.