scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.29k forks source link

[BUG]SMOTE Generating Nan Values #1036

Closed pradeepdev-1995 closed 1 year ago

pradeepdev-1995 commented 1 year ago

I am getting the Nan values after the result of SMOTE based resampling.

import pandas as pd
import numpy as np
majority_class = pd.DataFrame({'feature1': np.random.randn(5),
                               'feature2': np.random.randn(5),
                               'label': [0] * 5})
minority_class = pd.DataFrame({'feature1': np.random.randn(2),
                               'feature2': np.random.randn(2),
                               'label': [1] * 2})
imbalanced_dataset = pd.concat([majority_class, minority_class], ignore_index=True)
print(imbalanced_dataset)
from imblearn.over_sampling import SMOTE
x = imbalanced_dataset[['feature1','feature2']]
y = imbalanced_dataset[['label']]
smote = SMOTE(sampling_strategy='all',k_neighbors=1)
X_resampled_smote, y_resampled_smote = smote.fit_resample(x, y)
import pandas as pd
balanced_dataset = pd.concat([X_resampled_smote,y_resampled_smote],ignore_index=True)
print(balanced_dataset)

Output

imbalanced_dataset

feature1        feature2     label
0  0.222079 -0.104746      0
1 -0.767977 -0.525123      0
2  0.142465  1.912771      0
3 -0.034652 -2.026720      0
4  1.134339  1.119424      0
5  0.779193  1.130228      1
6 -1.101098  0.373119      1

After balancing

        feature1        feature2           label

0   0.496714       -0.234137       NaN
1   -0.138264        1.579213        NaN
2   0.647689    0.767435         NaN
3   1.523030       -0.469474       NaN
4   -0.234153   0.542560    NaN
5   -0.463418   0.241962    NaN
6   -0.465730   -1.913280   NaN
7   -0.464516   -0.782264   NaN
8   -0.464342   -0.619835   NaN
9   -0.463526   0.141386    NaN
10  NaN             NaN                     0.0
11  NaN             NaN                     0.0
12  NaN             NaN                     0.0
13  NaN             NaN                     0.0
14  NaN             NaN                     0.0
15  NaN             NaN                     1.0
16  NaN             NaN                     1.0
17  NaN             NaN                    1.0
18  NaN             NaN                    1.0
19  NaN             NaN                    1.0

Why I am getting a new data frame with so many Nan values? I am expecting a new resampled data frame without Nan values