Closed joaquinvanschoren closed 8 years ago
nope, smote does not handle multiclass at the moment.
i check the original paper for a few secs, i am very sure they also dont discuss this case there.
i think extensions are possible, although i also think this is "effort", both intellectually and w.r.t. to code.
what performance measure do you optimize?
We want to optimizing on Weighted Kappa. Does that matter?
We want to optimizing on Weighted Kappa. Does that matter?
just wanted to know. does mlr support this?
so, my answer i think is: mlr supports simple threshold optimization at the moment for multiclass. that might help. and class-reweighting. that might also help.
for stuff like smote somebody (like you) would need to work on extension. we have too many issues to solve at the moment. but this topic does interest me. i will also make a note for a student project.
more thoughts on this:
I guess an extension of smote could be doable and simple. We sport a single numeric factor that tells us how much to oversample the minority class. We could allow to pass a named vector here with k-1 entries. The entries would specify how much to oversample all other classes. We then just add a loop to smote.
you would still need to do a PR for this.
I've talked to people who solve it this way: say you have 3 classes, what you do is:
You can do this for any number of classes, you just run SMOTE (c-1) times. Does that make sense?
i need some context here:
what is it:
a) a pure feature request (that we should implement it) b) somebody, you, or from your side would help with this
and also 1) do you simply need this in a project, so you get some "results" 2) or do you want to write a paper on the multiclass case?
@joaquinvanschoren,
Can you give an example how you applied SMOTE on multiple classes?
Applying SMOTE for multiclass problems can be done by iterating over 2 classes each (majority class and minority class) and then trying to balance the minority class to the majority. As mentioned above it can be done (c-1) times. This works for non-sequential data. But for sequential tasks we need to atleast keep the sequence in the neighbourhood. Like within a paragraph of words bag-of-words is not a problem. But performing smote c-1 times leaves us with the problem of keeping the sequence intact (atleast to some extent). Has anyone tried to work through such problem?
Other idea is to generate noise based on the minibatch distribution over minority samples to oversample them. But there are no standard implementation for such tasks.
We're trying to use smote to handle an imbalanced classification problem (also see mlr-mbo issue 213)
However, Smote only seems to handle binary problems right now. Is there any reason why it should not work on multiclass problems? Is this an easy to make extension? Thanks!