scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.28k forks source link

Index is out of bounds for axis 0 #982

Open taimoorhussain1259 opened 1 year ago

taimoorhussain1259 commented 1 year ago

Hello everyone,

I used this library and it worked very well. Due to some conflicts in conda, I had to remake an environment. I reinstalled imbalanced-learn==0.10 but i am facing this issue. Please guide me. Thanks.

IndexError                                Traceback (most recent call last)
Cell In[8], line 72
     69 valid_index = train_valid_index[1]
     71 train_data_df = data_train_valid.iloc[train_index, :]
---> 72 train_data_df = generate_synthetic_samples(train_data_df)
     74 #train_data_df = [train_data_df, model_old.sample(num_rows=200), model_new.sample(num_rows=200)]
     75 #train_data_df = pd.concat(train_data_df)        
     77 X_train=train_data_df.iloc[:, :-1]

Cell In[7], line 11, in generate_synthetic_samples(data_df)

---> 11 X_smote, Y_smote = SMOTE(random_state=42).fit_resample(temp_df.iloc[:, :-1], temp_df.iloc[:, -1])

     12 X_border, Y_border = BorderlineSMOTE(random_state=42).fit_resample(temp_df.iloc[:, :-1], temp_df.iloc[:, -1])

File ~/anaconda3/envs/p39/lib/python3.9/site-packages/imblearn/base.py:203, in BaseSampler.fit_resample(self, X, y)
    182 """Resample the dataset.
    183 
    184 Parameters
   (...)
    200     The corresponding label of `X_resampled`.
    201 """
    202 self._validate_params()
--> 203 return super().fit_resample(X, y)

File ~/anaconda3/envs/p39/lib/python3.9/site-packages/imblearn/base.py:88, in SamplerMixin.fit_resample(self, X, y)
     82 X, y, binarize_y = self._check_X_y(X, y)
     84 self.sampling_strategy_ = check_sampling_strategy(
     85     self.sampling_strategy, y, self._sampling_type
     86 )
---> 88 output = self._fit_resample(X, y)
     90 y_ = (
     91     label_binarize(output[1], classes=np.unique(y)) if binarize_y else output[1]
     92 )
     94 X_, y_ = arrays_transformer.transform(output[0], y_)

File ~/anaconda3/envs/p39/lib/python3.9/site-packages/imblearn/over_sampling/_smote/base.py:356, in SMOTE._fit_resample(self, X, y)
    354 self.nn_k_.fit(X_class)
    355 nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
--> 356 X_new, y_new = self._make_samples(
    357     X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
    358 )
    359 X_resampled.append(X_new)
    360 y_resampled.append(y_new)

File ~/anaconda3/envs/p39/lib/python3.9/site-packages/imblearn/over_sampling/_smote/base.py:110, in BaseSMOTE._make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
    107 rows = np.floor_divide(samples_indices, nn_num.shape[1])
    108 cols = np.mod(samples_indices, nn_num.shape[1])
--> 110 X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
    111 y_new = np.full(n_samples, fill_value=y_type, dtype=y_dtype)
    112 return X_new, y_new

File ~/anaconda3/envs/p39/lib/python3.9/site-packages/imblearn/over_sampling/_smote/base.py:154, in BaseSMOTE._generate_samples(self, X, nn_data, nn_num, rows, cols, steps)
    114 def _generate_samples(self, X, nn_data, nn_num, rows, cols, steps):
    115     r"""Generate a synthetic sample.
    116 
    117     The rule for the generation is:
   (...)
    152         Synthetically generated samples.
    153     """
--> 154     diffs = nn_data[nn_num[rows, cols]] - X[rows]
    156     if sparse.issparse(X):
    157         sparse_func = type(X).__name__

IndexError: index 94092224477536 is out of bounds for axis 0 with size 348
glemaitre commented 1 year ago

We will need a minimal reproducer to be able to check if this is a bug or a misusage.

jkelin commented 1 year ago

Also seeing this in SMOTE, BorderlineSMOTE and ADASYN. 0.10 works fine, 0.11 breaks.

Repro:

from imblearn.over_sampling import BorderlineSMOTE
from sklearn.datasets import make_classification

X, y = make_classification(
    n_classes=2,
    class_sep=2,
    weights=[0.1, 0.9],
    n_informative=3,
    n_redundant=1,
    flip_y=0,
    n_features=20,
    n_clusters_per_class=1,
    n_samples=1000,
    random_state=10,
)

BorderlineSMOTE().fit_resample(
    X,
    y,
)

Throws IndexError: index 22117 is out of bounds for axis 0 with size 1000

Setting n_features=10, fixes the issue.

Python 3.11.4, Numpy 1.24.4, Linux