scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.29k forks source link

[BUG] SegFault with oversampler #984

Closed siaahmadi closed 1 year ago

siaahmadi commented 1 year ago

Running the following code leads to a segfault (Python 3.9.2):

import numpy as np
from imblearn.over_sampling import SMOTE

over = SMOTE(k_neighbors=3)

X = np.array([[35., 18.],
       [80.,  0.],
       [18., 40.],
       [58.,  0.],
       [73., 20.],
       [20., 26.],
       [53., 29.],
       [ 0., 20.],
       [ 2., 40.],
       [18., 35.],
       [ 0.,  0.],
       [22., 40.],
       [ 0., 33.],
       [37., 60.]])

y = np.array([1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1])
over.fit_resample(X,y)

Version info:

Python 3.9.2
# Name                    Version                   Build
imbalanced-learn          0.10.1           py39hecd8cb5_0
numpy                     1.23.5           py39he696674_0  
numpy-base                1.23.5           py39h9cd3388_0  
scipy                     1.10.0           py39h91c6ef4_1
monody1 commented 1 year ago

I got a similar issue but the ADASYN and KMeansSMOTE

OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata. OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata. [1] 3072415 segmentation fault (core dumped)

esaracin commented 1 year ago

I've been getting this and other C-level issues with nearly every sampler I've tried in imblearn 0.10.1 (recently raised another issue about them here). It's been pretty disappointing tbh.

For me, I've noticed the errors tend to arise with large data sizes. I was able to produce a segfault with SMOTE earlier today (with about a size of (20 million, 100)), but your example is working fine for me.

glemaitre commented 1 year ago

This looks like some low-level openblas issue. It could be linked to the internal NearestNeighbors since we don't do such low-level code. I would advise you to report upstream.

glemaitre commented 1 year ago

One potential issue that I got when releasing if scikit-learn is installed from the defaults channel, then it is built with LLVM/CLANG OMP that is incompatible with the MKL OMP that could be used in the install.

We reported the bug upstream: https://github.com/ContinuumIO/anaconda-issues/issues/13221

glemaitre commented 1 year ago

Providing the output of

python -m threadpoolctl -i sklearn

would allow checking if there is a mix of libomp and libiomp.