scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.83k stars 1.28k forks source link

Add support to multilabel #340

Closed glemaitre closed 6 years ago

glemaitre commented 7 years ago

We should add support to multilabel when y can be converted back to multiclass. It means that the sum of each row should be one.

chkoar commented 7 years ago

Are we talking about multilabel or multioutput/multiclass?

glemaitre commented 7 years ago

those are always confusing. an example will speak for itself (but it should a multilabel case encoding a multiclass)

[[0 0 1]
 [1 0 0]
 [0 1 0]]

is a multilabel-indicator type encoding the following:

[[2]
 [0]
 [1]]
chkoar commented 7 years ago

I wouldn't call it multilabel. It is a binarized version of the target, right? I am -1 for adding that logic inside the algorithms. We could use the LabelBinirizer for that, no?

massich commented 7 years ago

@chkoar I think that @glemaitre refers to provide the same support for y as scikit-learn does ( see here )

MarcoNiemann commented 6 years ago

Well, shouldn't multi-label be:

[[0,1,1],
 [1,0,0],
 [0,1,0],
 [1,0,1],
 [1,0,1],
 ...]

Because the version mentioned by @glemaitre appears - as stated by @chkoar - to be a binarized version of a multi-class problem. But the difference between multi-class and multi-label is that multi-class only allows the assignment of a single class to the target instance, whereas in a multi-label case it can be an arbitrary amount of class assignments.

For an implementation one might consider the label powerset transformation of multi-label data into a multiclass data set. So e.g. for the data set above one might apply the following transformation:

[[1],
 [2],
 [3],
 [4],
 [4],
 ...]

For all people searching for a quick and dirty solution I appear to have some success with the following solution:

from skmultilearn.problem_transformation import LabelPowerset
from imblearn.over_sampling import RandomOverSampler

# Import a dataset with X and multi-label y

lp = LabelPowerset()
ros = RandomOverSampler(random_state=42)

# Applies the above stated multi-label (ML) to multi-class (MC) transformation.
yt = lp.transform(y)

X_resampled, y_resampled = ros.fit_sample(X, yt)

# Inverts the ML-MC transformation to recreate the ML set
y_resampled = lp.inverse_transform(y_resampled)

(Use of the skmultilearn package for convenience sake to avoid custom transformation!)

glemaitre commented 6 years ago

imblearn accept by default one-vs-all enconding from now on

j-greer commented 6 years ago

@MarcoNiemann your solution works well when the imbalance occurs across the ith dimension of y rather than the jth.

Expanding upon your example:

[
[0,1,1],
[0,1,1],
[1,1,1],
[1,1,1],
[1,1,1],
[1,1,1],

 ...
]

Can be considered imbalanced along rows but take the following example:

[
[0,0,1],
[1,0,0],
[1,0,0],
[1,1,0],
[1,1,0],
 ...
]

This is imbalanced in the sense that yi3 is mostly zero. Do you know of a way of addressing this type of imbalance problem using imbalanced-learn? @glemaitre

rjurney commented 5 years ago

@glenmaitre This seems an unsolved problem in the Python space. Support for this would be amazing.

glemaitre commented 5 years ago

@rjurney The issue is that the literature does not address this problem. So I am not really sure how we could go forward. It would be cool to have an overview of the full literature. It is a while I did not look at it.

HabeebullahEbrahemi commented 5 years ago

just correcting the import part for my case python 3.7

from skmultilearn.problem_transform import LabelPowerset

daanvdn commented 5 years ago

@glemaitre, I found the article below that proposes MLSMOTE, an adaptation of SMOTE to multi-label problems:

Charte, Francisco, et al. "MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation." Knowledge-Based Systems 89 (2015): 385-397.

There is also an (open-source) java implementation on github: https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MLSMOTE.java

aamin21 commented 5 years ago

Any update on this? Stuck on this one.

woolr commented 4 years ago

@daanvdn do you know if anyone has implemented this in Python?

daanvdn commented 4 years ago

Not that I know of..

Sent from my mobile phone

On Fri, 10 Jan 2020, 00:54 Dan, notifications@github.com wrote:

@daanvdn https://github.com/daanvdn do you know if anyone has implemented this in Python?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/issues/340?email_source=notifications&email_token=ABM5Q7V3JNIQ24PKEGGCHHLQ462MLA5CNFSM4DZNIZ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEISGLZI#issuecomment-572810725, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABM5Q7RPHYFJXO2DWRKOFPLQ462MLANCNFSM4DZNIZ6A .

alfredsasko commented 4 years ago

@daanvdn, @glemaitre I read the referenced article by @daanvdn. Researches claim to be MLSMOTE superior in highly imbalanced multi-label datasets compared to other popular algorithms like BR, RAkEL, and CLR. They also provide algorithm pseudocode. I am trying to implement it in my project. ones I succeed will share the code with you.

t-lini commented 4 years ago

It might be worth also considering ML-ROS and ML-RUS as multilabel random over- and undersampling methods respectively, which were introduced by the authors of the article referenced by @daanvdn in an article prior to MLSMOTE, see: F. Charte, A.J. Rivera, M.J. del Jesus, F. Herrera, Addressing imbalance in multilabel classification: measures and random resampling algorithms, Neurocomputing 163(9) (2015) 3–16, http://dx.doi.org/10.1016/j.neucom.2014.08.091. These algorithms might be a good choice if you do not want to or can not use synthetic resampling methods. Implementations in Java are also available in the MULAN package: https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MultiLabelRandomOverSampling.java https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MutilLabelRandomUnderSampling.java I will try to implement these methods in Python.

chkoar commented 4 years ago

I will try to implement these methods in Python.

That would be a great addition

SimonErm commented 4 years ago

I have tried to implement MLSMOTE in Python, but since I am not an experienced Python programmer, it consists of a lot of stackoverflow solutions and ugly code. As far as logic is concerned, it should be correct. https://gist.github.com/SimonErm/b06c236cafdeb79fdf7adb90aef04fec

chkoar commented 4 years ago

@SimonErm I encourage you to add docstrings, write comments with your intention wherever you think it is appropriate, write some tests and open a PR in draft mode, so we could discuss your code in the PR.

Vishnux0pa commented 4 years ago

@SimonErm I tried you code and it works but it generates a random number of samples i,e I cant specify how many samples I would need. Is there a way to do that? Also, it would be good it you can share the paper

SimonErm commented 4 years ago

@Vishnux0pa That's because the number of generated labels is driven by the imbalance ratio of each label which is also discribed in the paper. You can find a reference in the description of the PR . It's the same mentioned by daanvdn:

I found the article below that proposes MLSMOTE, an adaptation of SMOTE to multi-label problems:

Charte, Francisco, et al. "MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation." Knowledge-Based Systems 89 (2015): 385-397.

There is also an (open-source) java implementation on github: https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MLSMOTE.java

xelandar commented 4 years ago

As far as I can see another implementation of MLSMOTE can be found here (via this medium article). A haven't tested it yet, but thought it would be good to share it here in relevant thread.

chkoar commented 4 years ago

@xelandar there is already a PR here but it hasn't got a review yet, probably due to lack of time.

balvisio commented 2 years ago

I have created a new PR that implements MLSMOTE: https://github.com/scikit-learn-contrib/imbalanced-learn/pull/927.

imaspol commented 1 month ago

Hi, it would be great to have a version of classification_report_imbalanced for multilabel imbalanced data. Do you plan to implement it?