Working with imbalanced text datasets

scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

https://imbalanced-learn.org

MIT License

6.85k stars 1.29k forks source link

Working with imbalanced text datasets #992

Closed MattesR closed 1 year ago

MattesR commented 1 year ago

Hi, I'm working with an imbalanced text dataset which I want to classify using BERT-Embeddings. As I understood, your library is not really suited for balancing text datasets (in combination with contextual word embeddings as features), as it works with numerical data only, correct?

chkoar commented 1 year ago

AFAICR random samplers does not check dtypes.

In [1]:  from imblearn.under_sampling import RandomUnderSampler; RandomUnderSampler(random_state=0).fit_resample([["neg"], ["neg"], ["pos"]], [0,0,1])
Out[1]: ([['neg'], ['pos']], [0, 1])

MattesR commented 1 year ago

Thanks for the answer. However, I don't really need a library for random sampling my dataset ;) I stumbled upon the library when researching better/different sampling methods for imbalanced datasets and I wanted to make sure that I didn't misunderstand some aspect of the algorithms.

chkoar commented 1 year ago

If you encode each of your documents in a vector of floats then you can use any method of the library.