Closed glemaitre closed 6 years ago
Are we talking about multilabel or multioutput/multiclass?
those are always confusing. an example will speak for itself (but it should a multilabel case encoding a multiclass)
[[0 0 1]
[1 0 0]
[0 1 0]]
is a multilabel-indicator type encoding the following:
[[2]
[0]
[1]]
I wouldn't call it multilabel. It is a binarized version of the target, right? I am -1 for adding that logic inside the algorithms. We could use the LabelBinirizer for that, no?
@chkoar I think that @glemaitre refers to provide the same support for y as scikit-learn does ( see here )
Well, shouldn't multi-label be:
[[0,1,1],
[1,0,0],
[0,1,0],
[1,0,1],
[1,0,1],
...]
Because the version mentioned by @glemaitre appears - as stated by @chkoar - to be a binarized version of a multi-class problem. But the difference between multi-class and multi-label is that multi-class only allows the assignment of a single class to the target instance, whereas in a multi-label case it can be an arbitrary amount of class assignments.
For an implementation one might consider the label powerset transformation of multi-label data into a multiclass data set. So e.g. for the data set above one might apply the following transformation:
[[1],
[2],
[3],
[4],
[4],
...]
For all people searching for a quick and dirty solution I appear to have some success with the following solution:
from skmultilearn.problem_transformation import LabelPowerset
from imblearn.over_sampling import RandomOverSampler
# Import a dataset with X and multi-label y
lp = LabelPowerset()
ros = RandomOverSampler(random_state=42)
# Applies the above stated multi-label (ML) to multi-class (MC) transformation.
yt = lp.transform(y)
X_resampled, y_resampled = ros.fit_sample(X, yt)
# Inverts the ML-MC transformation to recreate the ML set
y_resampled = lp.inverse_transform(y_resampled)
(Use of the skmultilearn
package for convenience sake to avoid custom transformation!)
imblearn accept by default one-vs-all enconding from now on
@MarcoNiemann your solution works well when the imbalance occurs across the ith dimension of y rather than the jth.
Expanding upon your example:
[
[0,1,1],
[0,1,1],
[1,1,1],
[1,1,1],
[1,1,1],
[1,1,1],
...
]
Can be considered imbalanced along rows but take the following example:
[
[0,0,1],
[1,0,0],
[1,0,0],
[1,1,0],
[1,1,0],
...
]
This is imbalanced in the sense that yi3 is mostly zero. Do you know of a way of addressing this type of imbalance problem using imbalanced-learn? @glemaitre
@glenmaitre This seems an unsolved problem in the Python space. Support for this would be amazing.
@rjurney The issue is that the literature does not address this problem. So I am not really sure how we could go forward. It would be cool to have an overview of the full literature. It is a while I did not look at it.
from skmultilearn.problem_transform import LabelPowerset
@glemaitre, I found the article below that proposes MLSMOTE, an adaptation of SMOTE to multi-label problems:
Charte, Francisco, et al. "MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation." Knowledge-Based Systems 89 (2015): 385-397.
There is also an (open-source) java implementation on github: https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MLSMOTE.java
Any update on this? Stuck on this one.
@daanvdn do you know if anyone has implemented this in Python?
Not that I know of..
Sent from my mobile phone
On Fri, 10 Jan 2020, 00:54 Dan, notifications@github.com wrote:
@daanvdn https://github.com/daanvdn do you know if anyone has implemented this in Python?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/issues/340?email_source=notifications&email_token=ABM5Q7V3JNIQ24PKEGGCHHLQ462MLA5CNFSM4DZNIZ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEISGLZI#issuecomment-572810725, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM5Q7RPHYFJXO2DWRKOFPLQ462MLANCNFSM4DZNIZ6A .
@daanvdn, @glemaitre I read the referenced article by @daanvdn. Researches claim to be MLSMOTE superior in highly imbalanced multi-label datasets compared to other popular algorithms like BR, RAkEL, and CLR. They also provide algorithm pseudocode. I am trying to implement it in my project. ones I succeed will share the code with you.
It might be worth also considering ML-ROS and ML-RUS as multilabel random over- and undersampling methods respectively, which were introduced by the authors of the article referenced by @daanvdn in an article prior to MLSMOTE, see:
F. Charte, A.J. Rivera, M.J. del Jesus, F. Herrera, Addressing imbalance in multilabel classification: measures and random resampling algorithms, Neurocomputing 163(9) (2015) 3–16, http://dx.doi.org/10.1016/j.neucom.2014.08.091.
These algorithms might be a good choice if you do not want to or can not use synthetic resampling methods. Implementations in Java are also available in the MULAN package:
https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MultiLabelRandomOverSampling.java
https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MutilLabelRandomUnderSampling.java
I will try to implement these methods in Python.
I will try to implement these methods in Python.
That would be a great addition
I have tried to implement MLSMOTE in Python, but since I am not an experienced Python programmer, it consists of a lot of stackoverflow solutions and ugly code. As far as logic is concerned, it should be correct. https://gist.github.com/SimonErm/b06c236cafdeb79fdf7adb90aef04fec
@SimonErm I encourage you to add docstrings, write comments with your intention wherever you think it is appropriate, write some tests and open a PR in draft mode, so we could discuss your code in the PR.
@SimonErm I tried you code and it works but it generates a random number of samples i,e I cant specify how many samples I would need. Is there a way to do that? Also, it would be good it you can share the paper
@Vishnux0pa That's because the number of generated labels is driven by the imbalance ratio of each label which is also discribed in the paper. You can find a reference in the description of the PR . It's the same mentioned by daanvdn:
I found the article below that proposes MLSMOTE, an adaptation of SMOTE to multi-label problems:
Charte, Francisco, et al. "MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation." Knowledge-Based Systems 89 (2015): 385-397.
There is also an (open-source) java implementation on github: https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MLSMOTE.java
@xelandar there is already a PR here but it hasn't got a review yet, probably due to lack of time.
I have created a new PR that implements MLSMOTE: https://github.com/scikit-learn-contrib/imbalanced-learn/pull/927.
Hi, it would be great to have a version of classification_report_imbalanced
for multilabel imbalanced data. Do you plan to implement it?
We should add support to multilabel when
y
can be converted back to multiclass. It means that the sum of each row should be one.