scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.28k forks source link

[ENH] Add E2SC to imbalanced-learn #1090

Open waashk opened 3 months ago

waashk commented 3 months ago

I'm opening this issue to propose the inclusion of the E2SC [1] method in the imbalanced-learn library. As the main author of this strategy, I believe that integrating E2SC will offer significant value to users dealing with imbalanced datasets.

Background:

The E2SC [1] method is currently considered state-of-the-art (SOTA) in the field of Instance Selection (IS). Although IS and undersampling have different primary objectives, they are related techniques as both aim to select a subset of representative data from larger datasets. The E2SC method was demonstrated to be particularly effective in addressing class imbalance issues by selecting the most informative instances, improving model performance through extensive experimentation.

Regarding the implementation details, I have already implemented the E2SC method in a separate repository, ensuring compatibility with both the scikit-learn and imbalanced-learn libraries, under the MIT license, promoting open-source collaboration and integration.

Describe the solution you'd like

Inclusion of the E2SC in the imbalanced-learn library. I would be happy to do this. I will make a PR referencing this issue soon. Please let me know if there is any additional information I should consider before proceeding.

Thank you for considering this enhancement. I look forward to the possibility of collaborating and contributing to the imbalanced-learn community.

[1] Cunha, Washington, et al. "An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification." Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023.

lucasgsfelix commented 4 days ago

It would be really interesting to see your solution on the library. I use the imblearn constantly, and by reading the paper, it looks like it could have a better efficiency than some of the methods that are available.