Closed jsgounot closed 8 months ago
Is this the intended behavior?
Yes, it is. Please note that the intention is not to balance the dataset but to improve the classification performance.
I am referencing the original paper where the combination of SMOTE + Tomek Link
and SMOTE + ENN
is proposed.
Note that in the quotation says TL but in the next paragraph the authors explain why to use ENN
instead of TL
so ti clearly derived.
Thus, instead of removing only the majority class examples that form Tomek links, examples from both classes are removed.
Having said that, you can pass a custom heuristic or your configured version of your own ENN.
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.combine import SMOTEENN
custom_enn = ...
smote_enn = SMOTEENN(enn=enn)
OK, thanks for taking the time to answer!
Hi everyone,
I'm fairly new in the machine learning field, so my apologies if the question seems very simple. I'm trying to do some classification on several datasets with some being not well separated. I observed sometimes that SMOTEEN output can result to an even more unbalanced dataset. A small example:
You can already see with this dataset that SMOTEEN is not performing well:
When you run the two components separately, we can see the issue here:
So clearly, the ENN method undersampled way more samples from one class compared to the others, as it treated both class equally after the SMOTE process. I assume here the reason is that there is less variation within the over-sampled class compared to the dominant. In some real datasets, I sometimes even observed the complete disappearance of one of the classes. While I think I understand the reason behind this, I wonder if this is not an issue that some users might not be aware of, as this behavior is completely silent in most cases when used as a pipeline. Is this the intended behavior? Thanks!