scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.81k stars 1.28k forks source link

Oversampling from sequential data #23

Closed victorhcm closed 8 years ago

victorhcm commented 8 years ago

Hey there! I would like to know how should I handle sequential data, that is, the design matrix X is of the form (nsamples, sequence_len, n_feature_dims) and the target vector Y comprises the class labels (Yi \in {0,1,2,...,K}), shaped as (nsamples).

What I'm currently doing is ignoring the second dimension (the sequence length) by selecting a fixed sequence position, such as X[:,0,:]. Thus, it yields a matrix shaped as (nsamples, n_feature_dims). However, I'm not sure if this is the correct way to proceed, because once I sample, it will not be possible to know to which sequence the sample belongs.

Do you know of any workarounds?

fmfn commented 8 years ago

Very interesting question. Unfortunately I am afraid none of the methods available here were built with sequential data in mind.

I believe the most straightforward solution would be to flatten the design matrix along the second dimension. That, however, assumes all sequences have the same length, in addition to severely increasing the dimensionality of the dataset which poses a challenge to distance based methods of resampling.

I imagine you must be working on a RNN-like model. If you could pre-train it in a unsupervised fashion (max margin maybe?), then you could use a hidden, flat, representation to feed the over-samplers.

victorhcm commented 8 years ago

Thanks for your response, @fmfn!

So simple, didn't think in reshaping it. Also, great suggestion about using the hidden layer to feed the over-samplers. I'll try that in the future, if the other solution doesn't work as expected.

I'm currently reshaping and trying it with SMOTE, using borderline1 and ratio 0.4, but it doesn't find samples in danger.

Determining classes statistics... 6 classes detected: {1: 42, 2: 259, 3: 208, 4: 16, 5: 6, 6: 72} Finding the 10 nearest neighbours...done! There are no samples in danger. No borderline synthetic samples created.

Do you know what may be causing that?

fmfn commented 8 years ago

Hum, you are really pushing this package to its limits (which is awesome btw).

First, bSMOTE was not designed with multi-class case in mind (sorry). I imagine you will have to do it manually, following a one-vs-all scheme. Something like: z1 = y[y == 5] oversample with x, z1 z2 = y[y == 4] ... The way you choose to approach it can be different though, this is uncharted territory.

As for samples in danger:

A minority sample is in danger if more than half of its nearest neighbours belong to the majority class. The exception being a minority sample for which all its nearest neighbours are from the majority class, in which case it is considered noise.

victorhcm commented 8 years ago

Haha, sorry about that :)

I think it will be straightforward to make it one-vs-all. I'll give it a try.

Thanks for explaining samples in danger. Googling it now I'm seeing that I didn't do my homework properly before asking you, as the definition is in the repository and in the paper.

victorhcm commented 8 years ago

I was able to make it work using the one-against-all approach. One thing I noticed is that my unbalanced class is so small that SMOTE with borderline1 and bordeline2 didn't detect any samples in danger and considered some of them as noise. SVM SMOTE was able to detect some in danger and to draw some samples.

Just one thing remaining, though. I think the problem of doing it one-vs-all is that the number of samples of the majority class will become even higher, which I guess makes it harder to find samples in danger (or it may consider any minority sample as noise). Maybe using only one majority class (the larger one) along with the targeted minority class may provide better results.

fmfn commented 8 years ago

I was able to make it work using the one-against-all approach. One thing I noticed is that my unbalanced class is so small that SMOTE with borderline1 and bordeline2 didn't detect any samples in danger and considered some of them as noise. SVM SMOTE was able to detect some in danger and to draw some samples.

bSMOTE is definitely limited by the representation of your data (vector space it is embedded in). On the one hand, not finding samples in danger is a good thing, it means a simple KNN classifier might be able to separate that class. Now, whether or not that is going to generalize well...

Just one thing remaining, though. I think the problem of doing it one-vs-all is that the number of samples of the majority class will become even higher, which I guess makes it harder to find samples in danger (or it may consider any minority sample as noise). Maybe using only one majority class (the larger one) along with the targeted minority class may provide better results.

That's a very good point, you are probably right.