generating samples when training set is sparse

Description

In case of a sparse training set X, the algorithm uses scipy.sparse to make the computation more efficient. A synthetic sample will only be generated when X[row].nnz is not 0. That means, when the current sample has 0 for all its features, no synthetic sample will be generated. See lines 115-123 in smote.py. (The only exception is that we store samples with all 0 features as elements in the sparse matrix.)

The new samples should be generated using the rule: $s_{new} = s_i - \epsilon \times (si-s{nn})$, where $s_{new}$ is the new synthetic sample, $si$ is the current sample, and $s{nn}$ is a nearest neighbor of $s_i$. When the $s_i$ is all zeros, $snew = \epsilon \times s{nn}$. Thus I think a sample should still be generated using its nearest neighbors and they should be treated the same way as the other samples.

scikit-learn-contrib / imbalanced-learn

generating samples when training set is sparse #582

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions