Open laurallu opened 5 years ago
I am not sure this is an issue because there is only a single sample with all zeros. I would not be surprised that such a sample will make the algorithm crash but I would need to run the test.
PR is welcomed if you can show that it works
Description
In case of a sparse training set X, the algorithm uses scipy.sparse to make the computation more efficient. A synthetic sample will only be generated when X[row].nnz is not 0. That means, when the current sample has 0 for all its features, no synthetic sample will be generated. See lines 115-123 in smote.py. (The only exception is that we store samples with all 0 features as elements in the sparse matrix.)
The new samples should be generated using the rule: $s_{new} = s_i - \epsilon \times (si-s{nn})$, where $s_{new}$ is the new synthetic sample, $si$ is the current sample, and $s{nn}$ is a nearest neighbor of $s_i$. When the $s_i$ is all zeros, $snew = \epsilon \times s{nn}$. Thus I think a sample should still be generated using its nearest neighbors and they should be treated the same way as the other samples.
Steps/Code to Reproduce
Expected Results
Actual Results
Versions