scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.75k stars 1.27k forks source link

New methods #105

Open glemaitre opened 7 years ago

glemaitre commented 7 years ago

This is a non exhaustive list of the methods that can be added for the next release.

Oversampling:

Prototype Generation/Selection:

Ensemble

Regression

P. Branco, L. Torgo and R. Ribeiro (2016). A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 49, 2, 31. DOI: http://dx.doi.org/10.1145/2907070

Branco, P. and Torgo, L. and Ribeiro R.P. (2017) "Pre-processing Approaches for Imbalanced Distributions in Regression" Special Issue on Learning in the Presence of Class Imbalance and Concept Drift. Neurocomputing Journal. (submitted).

glemaitre commented 7 years ago

@dvro @chkoar you can add anything there. We can make a PR to add these stuff in the todo list.

We should also discuss where these methods will be added (under-/over-sampling or new module)

chkoar commented 7 years ago

SGP it should be placed in a new module/package like in scikit-protopy. generation is a reasonable name for this kind of algorithm.

glemaitre commented 7 years ago

@chkoar What would be the reason to disassociate over-sampling and generation?

chkoar commented 7 years ago

Actually none. Just for semantic reasons. Obviously, prototype generation methods could be considered as over-sampling methods.

dvro commented 7 years ago

@glemaitre actually, oversampling is different than prototype generation:

Prototype Selection: given a set of samples S, a PS method selects a subset S', where S' \in S and |S'| < |S| Prototype Generation: given a set of samples S, a PG method generates a new set S', where |S'| < |S|. Oversampling: given a set of samples S, an OS method generates a new set S', where |S'| > |S| and S \in S'

chkoar commented 7 years ago

Thanks for the clarification @dvro. That could be placed in the wiki!

dabrze commented 7 years ago

Hi,

If by SPIDER you mean algorithms from: "Selective Pre-processing of Imbalanced Data for Improving Classification Performance" and "Learning from imbalanced data in presence of noisy and borderline examples", maybe I could be of some help. I know the authors and maybe I could implement a python version of this algorithm with their "supervision"? That might be "safer" than using only pseudo-codes from conference papers.

glemaitre commented 7 years ago

Yes, it is this article. We would be happy for having PR on that. We are going to make a sprint at some point for developing some of the above methods.

The only important thing is to follow the scikit-learn convention regarding the estimator but this is something that we will also take care at the moment of the revision.

chkoar commented 6 years ago

MetaCost could be a nice addition.

glemaitre commented 6 years ago

Yep. You can added it up in the previous list.

mwydmuch commented 6 years ago

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

chkoar commented 6 years ago

@mwydmuch PRs are always welcome. With the addition of #360 will start the ensemble methods module and I think that we'll deprecate the current ensemble based samplers.

chkoar commented 6 years ago

@glemaitre do you think that we should have requirements, e.g. number of citations, before we merge an implementation into the package?

glemaitre commented 6 years ago
                                                                                  I would say no. This is something that scikit-learn is doing but the contrib are here to give some freedom regarding that and have bleeding-age estimator. I would just require that the estimator to show some advantage on some benchmark such that we can explain to users when using it.
chkoar commented 6 years ago

@glemaitre I was thinking to ask @mwydmuch to include a comparison with the BalancedBaggingClassifier (#360) but I thought that would be a nice addition after the implementation, and not a requirement. I think that we are on the same side here. Apart from that, we actually have requirements like the dependesies, right?

glemaitre commented 6 years ago

yes, regarding the dependencies, we are limiting only numpy/scipy/scikit-learn. Then, we can see if we can vendor but it should be avoided as much as possible.

Regarding the comparison, it is a bit my point when making a benchmark. I need to fix #360 in fact :)

mwydmuch commented 6 years ago

Thank you for comments, I will look into #360 then. And I can also prepare comparison between these methods :)

souravsingh commented 6 years ago

@glemaitre I would be interested in adding RUSBoost as part of the algorithm. Would it be fine if we inherit the code from AdaBoost, since RUSBoost is similar to Adaboost, except for small changes in training part?

glemaitre commented 6 years ago

@souravsingh it looks like it.

chandreshiit commented 6 years ago

@glemaitre Hi, I've worked on class-imbalance problems in the past. Over-sampling/under-sampling are too costly for big data problems. In my case, I tried to train oversampler on a dataset of size 0.2million X 1.4K, it ran out of memory on a PC having 32 GB RAM though I's using pandas sparse dataframe. Therefore, i would suggest to categorise algorithms in two category 1. Algorithms for small datasets 2. Algorithms for big datasets. In the second case, we can implement methods algorithms such as cost-sensitive learning, distributed and online based methods for class-imbalance learning. This will make the API a general purpose suitable for small scale as well as large scale datasets.

glemaitre commented 6 years ago

@chandu8542 I don't especially agree with your categori anzation which is more based on an engineering point-of-view rather than a "method" point-of-view. For instance, it would not make sense to classify SVM classifier for small scale problem category; the important feature is that the SVM classifier should live under a SVM module.

However, your comments are useful and should be used to improve the user guide/docstring of the different methods.

Regarding the cost-sensitive methods, we need to implement some and they would be useful for sure :)

chandreshiit commented 6 years ago

@glemaitre Within the engineering point-of-view, "method" point-of-view is embedded if you take a bird's-eye view. This will ease the task of a practitioner in choosing the right method for their problem at hand. That should be the sole purpose of open-source libraries; to make them more usable for research community as well as practitioner otherwise people (researcher/practitioner) will find it difficult to choose the right method when they have several methods under one umbrella. Rest is up to you.

massich commented 6 years ago

I think that everyone is right in this discussion. However, I agree with @glemaitre that the main indexing should by method type, not characteristic. But it would be necessary to have @chandu8542 criteria on the benchmarking to see how all algorithms perform in terms of memory, speed, etc.. using some datasets at different set sizes. Of course, such benchmark should come with narrative documentation to guide the method's choice by the user. As always, PRs are welcome. We would gladly put our time into reviewing such PR so that nobody ever again faces the same troubles.

chkoar commented 5 years ago

Cluster Based Oversampling1

  1. Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.
chkoar commented 5 years ago

Random-SMOTE1

  1. Dong, Y., & Wang, X. (2011). A new over-sampling approach: random-SMOTE for learning from imbalanced data sets. In International Conference on Knowledge Science, Engineering and Management
chkoar commented 5 years ago

Supervised Over-Sampling1

1.Hu, J., He, X., Yu, D. J., Yang, X. B., Yang, J. Y., & Shen, H. B. (2014). A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction. PloS one, 9(9), e107676.

lrq3000 commented 4 years ago

A new one:

Sharifirad, S., Nazari, A., & Ghatee, M. (2018). Modified smote using mutual information and different sorts of entropies. arXiv preprint arXiv:1803.11002.

Includes MIESMOTE, MAESMOTE, RESMOTE and TESMOTE.

Since SMOTE is mostly a meta-algorithm to interpolate new sample, with a defined strategy that change depending on the author, would it be possible to implement a generic SMOTE model where the user can provide a custom function to make his own version of SMOTE? This might also ease the writing (and contribution) of new SMOTE models.

halimkas commented 4 years ago

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

Hi Marek, Kindly share with me python implementation of Roughly Balanced Bagging, i will be gratefull for you help.

Thank you.

Haleem

Matgrb commented 4 years ago

Hello,

I am writing because in my current use case I am working on, we would love to have a certain oversampling feature, yet, it is not implemented anywhere. Therefore I would like to propose it here.

We are building an NLP model for binary classification, where one of the classes is strongly imbalanced. Therefore, one of the approaches would be to oversample using data augmentation techniques for nlp, e.g. using nlpaug library replace some words with synonyms. Having a class in the library, which allows to package the augmentation into the sklearn pipeline would be great! I can also see this being used in Computer Vision.

Let me know what do you think? Whether this could become one of the features in this library, and in that case I would love to contribute. If it doesn't fit into this library, do you know any other open source project where this would fit?

Cheers, Mateusz

beeb commented 3 years ago

Not sure if this is the right place, but for my work I implemented a custom version of SMOTE for Regression as described in this paper:

Torgo L., Ribeiro R.P., Pfahringer B., Branco P. (2013) SMOTE for Regression. In: Correia L., Reis L.P., Cascalho J. (eds) Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science, vol 8154. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40669-0_33

As mentioned in the original post, it would be nice to get SMOTE for Regression in imbalanced-learn.

Sandy4321 commented 3 years ago

I think that everyone is right in this discussion. However, I agree with @glemaitre that the main indexing should by method type, not characteristic. But it would be necessary to have @chandu8542 criteria on the benchmarking to see how all algorithms perform in terms of memory, speed, etc.. using some datasets at different set sizes. Of course, such benchmark should come with narrative documentation to guide the method's choice by the user. As always, PRs are welcome. We would gladly put our time into reviewing such PR so that nobody ever again faces the same troubles.

Great ideas Do you have something implemented? For example criteria on the benchmarking to see how all algorithms perform in terms of memory, speed, etc.. using some datasets at different set sizes

Sandy4321 commented 3 years ago

Not sure if this is the right place, but for my work I implemented a custom version of SMOTE for Regression as described in this paper:

Torgo L., Ribeiro R.P., Pfahringer B., Branco P. (2013) SMOTE for Regression. In: Correia L., Reis L.P., Cascalho J. (eds) Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science, vol 8154. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40669-0_33

As mentioned in the original post, it would be nice to get SMOTE for Regression in imbalanced-learn.

Can you share link to code?

Sandy4321 commented 3 years ago

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

Hi Marek, Kindly share with me python implementation of Roughly Balanced Bagging, i will be gratefull for you help.

Thank you.

Haleem

Is code shared , so far?

halimkas commented 3 years ago

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

Hi, I hope this is a good place to write about it: I have an implementation of Roughly Balanced Bagging (Under-Bagging method) with an extension for multiclass problems (based on this article) as an extension of bagging class from sklearn, made a few months ago. I will gladly polish this implementation to match this package conventions for bagging classifiers and made a pull request if you are interested in such contribution.

Hi Marek, Kindly share with me python implementation of Roughly Balanced Bagging, i will be gratefull for you help. Thank you. Haleem

Is code shared , so far?

Not yet!

thank you.

Haleem

chkoar commented 3 years ago

@beeb actually they are call it imbalanced regression but to my view it is not. All the thing they call utility based learning and the key thing is around the utility function that it is used, right? In any case you can draft an implementations talk about it.

beeb commented 3 years ago

Can you share link to code?

Here is the code of the original paper and also what I took as inspiration for my modified implementation https://rdrr.io/cran/UBL/man/smoteRegress.html

beeb commented 3 years ago

@beeb actually they are call it imbalanced regression but to my view it is not. All the thing they call utility based learning and the key thing is around the utility function that it is used, right? In any case you can draft an implementations talk about it.

I'm not sure what you are saying. It's SMOTE but they use a function to determine if a data point is common or "rare" depending on how far away from the mean of the distribution it falls (kind of - I used the extremas of the whiskers of a box plot as the inflection points for a CubicHermiteSpline that defines "rarity", I think they also use this in the original code). Then they oversample those by selecting a random NN and computing the new sample in between (just like SMOTE) , the difference is that the label value for the new point is a weighted average of the labels for the two parents.

chkoar commented 3 years ago

@beeb yeap. i have read all their related work. Since they invonvle that utility function, to me is not imbalanced regression but something like cost sensitive/interested regression. Apart from my personal opinion, I think that this method still remains in the scope of the package so I would love to see that implemented in imbalanced-learn. Please open a PR when you have time. It will be much appreciated.

zoj613 commented 3 years ago

Is there any interest in adding Localized Random Affine Shadowsampling (LoRAS) from the maintainers?

To quote from the paper's abstract:

We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

If there is interest in inclusion to the library, then I can prepare a PR.

Reference: Bej, S., Davtyan, N., Wolfien, M. et al. LoRAS: an oversampling approach for imbalanced datasets. Mach Learn 110, 279–301 (2021). https://doi.org/10.1007/s10994-020-05913-4

Sandy4321 commented 3 years ago

@zoj613

This version does not yet include the t-embedding parameter. can you share updated code pls

https://link.springer.com/article/10.1007%2Fs10994-020-05913-4

Code availability A preliminary implementation of the algorithm in Python (V 3.7.4) and an example Jupyter Notebook for the credit card fraud detection dataset is provided on the GitHub repository https ://githu b.com/sbi-rosto ck/LoRAS . This version does not yet include the t-embedding parameter

zoj613 commented 3 years ago

@zoj613

This version does not yet include the t-embedding parameter. can you share updated code pls

link.springer.com/article/10.1007%2Fs10994-020-05913-4

Code availability A preliminary implementation of the algorithm in Python (V 3.7.4) and an example Jupyter Notebook for the credit card fraud detection dataset is provided on the GitHub repository https ://githu b.com/sbi-rosto ck/LoRAS . This version does not yet include the t-embedding parameter

I am not the author of the paper I posted here, so you will have to email the authors directly. I however have an implementation I wrote last night that includes the tsne embedding. I am yet to upload it on git since it is not yet polished.

Sandy4321 commented 3 years ago

@zoj613

great synchronization you did it last night and I already ask about this only one question where you are going to upload it , on which github repo?

zoj613 commented 3 years ago

@zoj613

great synchronization you did it last night and I already ask about this only one question where you are going to upload it , on which github repo?

Here it is: https://github.com/zoj613/pyloras . I would have submitted a PR, but it appears as though the maintainers have no interest in this.

hayesall commented 3 years ago

Hey @zoj613 and @Sandy4321, please keep discussion focused, it creates a lot of noise otherwise.

@zoj613 I'm -1 on including it right now.

We loosely follow scikit-learn's rule of thumb to keep maintenance burden down. Methods should roughly be 3 years old and 200+ citations.

zoj613 commented 3 years ago

Hey @zoj613 and @Sandy4321, please keep discussion focused, it creates a lot of noise otherwise.

@zoj613 I'm -1 on including it right now.

We loosely follow scikit-learn's rule of thumb to keep maintenance burden down. Methods should roughly be 3 years old and 200+ citations.

Fair enough. Keeping to the topic at hand, I submitted a PR at #789 implementing SMOTE-RSB from the checklist in the OP.

glemaitre commented 3 years ago

I think that we should prioritize the SMOTE variants that we want to include. We could reuse the benchmark proposed there: https://github.com/analyticalmindsltd/smote_variants/issues/14#issuecomment-552884893

Basically, we could propose to implement the following:

Currently, we have SVM/KMeans/KNN based SMOTE for historical reasons rather than performance reasons.

I think that we should probably make an effort regarding the documentation. Currently, we show the differences regarding how the methods are sampling (this is already a good point). However, I think that we should have a clearer guideline on SMOTE works best for which applications. What I mean is that SMOTE, SMOTENC, SMOTEN, might already cover a good basis.

BradKML commented 1 year ago

@glemaitre are there any standard APIs to follow for the SMOTE variants?

glemaitre commented 1 year ago

Whenever possible it should inherit from SMOTE. You can check the current code hierarchy that we have for SMOTE.