gykovacs / smote_variants

A collection of 85 minority oversampling techniques
1 stars 0 forks source link

Proportion parameter #1

Open kuzey-edes opened 4 years ago

kuzey-edes commented 4 years ago

For each proportion setting, I still have the same number of objects. To me, the package seems to balance the dataset as if proportion = 1, regardless of which value I assign to proportion.

The code is given below.

from sklearn import datasets import smote_variants as sv from imblearn.datasets import make_imbalance

resample for proportion = 0.5

oversampler1= sv.MulticlassOversampling(sv.SMOTE_Cosine(proportion=0.5,random_state=42))

resample for proportion = 3

oversampler2= sv.MulticlassOversampling(sv.SMOTE_Cosine(proportion=3,random_state=42))

resample for proportion = 1

oversampler3= sv.MulticlassOversampling(sv.SMOTE_Cosine(proportion=1,random_state=42))

iris dataset

iris = datasets.load_iris() X = iris.data y = iris.target

make iris dataset imbalanced

X,y = make_imbalance(X, y, sampling_strategy={0: 5, 1: 20, 2: 50}, random_state=42)

X1, y1 = oversampler1.sample(X, y)

X2, y2 = oversampler2.sample(X, y)

X3, y3 = oversampler3.sample(X, y)

check the number of samples

print(len(X1))

print(len(X2))

print(len(X3))

gykovacs commented 4 years ago

Well, yeah, this issue is a bit odd, I should have document it properly.

So, the thing is, that proportion makes total sense in binary classification lroblems, however, in multiclass problems it is not interpretable, as all classes might be oversampled with a different proportion. Pu in another way, in multiclass problems the question is: proportion of what to what? The concept of proportion is closely related to binary classification problems. As the number of classes grows, the number of proportion-like parameters grows exponentially, and also becomes hard to interpret. So, in multiclass settings, proportion is completely overwritten and datasets are balanced by default.

I am really open to find the right way of oversampling multiclass datasets with a proportion parameter, without turning it into a problem of setting the expected proportion of all classes vs all classes.

Any comments are welcome!