scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.28k forks source link

generating samples when training set is sparse #582

Open laurallu opened 5 years ago

laurallu commented 5 years ago

Description

In case of a sparse training set X, the algorithm uses scipy.sparse to make the computation more efficient. A synthetic sample will only be generated when X[row].nnz is not 0. That means, when the current sample has 0 for all its features, no synthetic sample will be generated. See lines 115-123 in smote.py. (The only exception is that we store samples with all 0 features as elements in the sparse matrix.)

The new samples should be generated using the rule: $s_{new} = s_i - \epsilon \times (si-s{nn})$, where $s_{new}$ is the new synthetic sample, $si$ is the current sample, and $s{nn}$ is a nearest neighbor of $s_i$. When the $s_i$ is all zeros, $snew = \epsilon \times s{nn}$. Thus I think a sample should still be generated using its nearest neighbors and they should be treated the same way as the other samples.

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented 5 years ago

I am not sure this is an issue because there is only a single sample with all zeros. I would not be surprised that such a sample will make the algorithm crash but I would need to run the test.

PR is welcomed if you can show that it works