Closed victorhcm closed 8 years ago
Very interesting question. Unfortunately I am afraid none of the methods available here were built with sequential data in mind.
I believe the most straightforward solution would be to flatten the design matrix along the second dimension. That, however, assumes all sequences have the same length, in addition to severely increasing the dimensionality of the dataset which poses a challenge to distance based methods of resampling.
I imagine you must be working on a RNN-like model. If you could pre-train it in a unsupervised fashion (max margin maybe?), then you could use a hidden, flat, representation to feed the over-samplers.
Thanks for your response, @fmfn!
So simple, didn't think in reshaping it. Also, great suggestion about using the hidden layer to feed the over-samplers. I'll try that in the future, if the other solution doesn't work as expected.
I'm currently reshaping and trying it with SMOTE, using borderline1
and ratio 0.4, but it doesn't find samples in danger.
Determining classes statistics... 6 classes detected: {1: 42, 2: 259, 3: 208, 4: 16, 5: 6, 6: 72} Finding the 10 nearest neighbours...done! There are no samples in danger. No borderline synthetic samples created.
Do you know what may be causing that?
Hum, you are really pushing this package to its limits (which is awesome btw).
First, bSMOTE was not designed with multi-class case in mind (sorry). I imagine you will have to do it manually, following a one-vs-all scheme. Something like:
z1 = y[y == 5]
oversample with x, z1
z2 = y[y == 4]
...
The way you choose to approach it can be different though, this is uncharted territory.
As for samples in danger:
A minority sample is in danger if more than half of its nearest neighbours belong to the majority class. The exception being a minority sample for which all its nearest neighbours are from the majority class, in which case it is considered noise.
Haha, sorry about that :)
I think it will be straightforward to make it one-vs-all. I'll give it a try.
Thanks for explaining samples in danger. Googling it now I'm seeing that I didn't do my homework properly before asking you, as the definition is in the repository and in the paper.
I was able to make it work using the one-against-all approach. One thing I noticed is that my unbalanced class is so small that SMOTE with borderline1
and bordeline2
didn't detect any samples in danger and considered some of them as noise. SVM SMOTE was able to detect some in danger and to draw some samples.
Just one thing remaining, though. I think the problem of doing it one-vs-all is that the number of samples of the majority class will become even higher, which I guess makes it harder to find samples in danger (or it may consider any minority sample as noise). Maybe using only one majority class (the larger one) along with the targeted minority class may provide better results.
I was able to make it work using the one-against-all approach. One thing I noticed is that my unbalanced class is so small that SMOTE with borderline1 and bordeline2 didn't detect any samples in danger and considered some of them as noise. SVM SMOTE was able to detect some in danger and to draw some samples.
bSMOTE is definitely limited by the representation of your data (vector space it is embedded in). On the one hand, not finding samples in danger is a good thing, it means a simple KNN classifier might be able to separate that class. Now, whether or not that is going to generalize well...
Just one thing remaining, though. I think the problem of doing it one-vs-all is that the number of samples of the majority class will become even higher, which I guess makes it harder to find samples in danger (or it may consider any minority sample as noise). Maybe using only one majority class (the larger one) along with the targeted minority class may provide better results.
That's a very good point, you are probably right.
Hey there! I would like to know how should I handle sequential data, that is, the design matrix
X
is of the form(nsamples, sequence_len, n_feature_dims)
and the target vectorY
comprises the class labels (Yi \in {0,1,2,...,K}), shaped as(nsamples)
.What I'm currently doing is ignoring the second dimension (the sequence length) by selecting a fixed sequence position, such as
X[:,0,:]
. Thus, it yields a matrix shaped as(nsamples, n_feature_dims)
. However, I'm not sure if this is the correct way to proceed, because once I sample, it will not be possible to know to which sequence the sample belongs.Do you know of any workarounds?