scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.82k stars 1.28k forks source link

[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update #817

Closed jruokolainen closed 2 years ago

jruokolainen commented 3 years ago

I've been using SMOTETomek in production with success for a while. The 0.7.6 version runs through the dataset in around 5-8min. Updated and the new version ran for 1,5h before I killed the process.

               balancer = SMOTETomek(random_state=2425, n_jobs=-1)
               df_resampled, target_resampled = balancer.fit_resample(dataframe, target)
               return df_resampled, target_resampled
hayesall commented 3 years ago

Something similar was reported in #784

Can you include a portion of your data, and environment details from this command:

python -c 'import imblearn; imblearn.show_versions(github=True)'
glemaitre commented 3 years ago

In #784, it was indeed not due to imbalanced-learn but to the new scikit-learn 0.24.

@jruokolainen Could downgrade scikit-learn to 0.23 (it should still work even if we force to use 0.24).

Could you give the shape of the array and the data types such that we can try to reproduce.

glemaitre commented 3 years ago

as well as the info asked by @hayesall. It would really useful.

glemaitre commented 3 years ago

I used the following code:

%%time

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1_000_000, n_features=10,
    n_classes=3, weights=[0.05, 0.1, 0.85],
    n_informative=4, random_state=0
)

from imblearn.combine import SMOTETomek

SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)

With imbalanced-learn 0.7.0 and master and scikit-learn 0.23.X and master, both times, I am getting a wall time of 6 minutes. We really need to have all information regarding all numpy, scipy, scikit-learn and system version and the dimensionality of the problem.

jruokolainen commented 3 years ago

Yeah I can't share the dataset but overall it has approx. 20 million rows, with class imbalance of 99% negative , 1% positive labels and around 50 feature columns. These we're the versions that caused the bug. Works like a charm with imbalanced-learn 0.7.0 imbalanced-learn-0.8.0 scikit-learn-0.24.1

glemaitre commented 3 years ago

When you say that imbalanced-learn 0.7.0 works like a charm, was it with scikit-learn 0.23 ou 0.24 Could you also provide the OS that you are working with?

jruokolainen commented 3 years ago

It's working on OS X Mojave with 0.7.0 and scikit 0.24.1 but not with 0.8 imblearn

jruokolainen commented 3 years ago

This one doesn't complete, occupies a lot of threads on my MacBook pro 2019 but doesn't strain CPU at all.

`❯ pipenv run python -c 'import imblearn; imblearn.show_versions(github=True)'

System, Dependency Information **System Information** * python : `3.8.7 (v3.8.7:6503f05dd5, Dec 21 2020, 12:45:15) [Clang 6.0 (clang-600.0.57)]` * executable: `/Users/jokke/.local/share/virtualenvs/b2c-p2p-scorer-model-h2qOSVv_/bin/python` * machine : `macOS-10.16-x86_64-i386-64bit` **Python Dependencies** * pip : `21.0.1` * setuptools: `53.0.0` * imblearn : `0.8.0` * sklearn : `0.24.1` * numpy : `1.19.5` * scipy : `1.6.2` * Cython : `None` * pandas : `1.2.3` * keras : `None` * tensorflow: `2.4.1` * joblib : `1.0.1``
jruokolainen commented 3 years ago

These run without a problem in GCP ai platform `Python Dependencies

jruokolainen commented 3 years ago

This would roughly simulate the dataset size I'm using

%%time
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek
X, y = make_classification(
    n_samples=850862*3, n_features=63,
    n_classes=2, weights=[0.05, 0.95],
    n_informative=4, random_state=0)
SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)
jruokolainen commented 3 years ago

I used the following code:

%%time

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1_000_000, n_features=10,
    n_classes=3, weights=[0.05, 0.1, 0.85],
    n_informative=4, random_state=0
)

from imblearn.combine import SMOTETomek

SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)

With imbalanced-learn 0.7.0 and master and scikit-learn 0.23.X and master, both times, I am getting a wall time of 6 minutes. We really need to have all information regarding all numpy, scipy, scikit-learn and system version and the dimensionality of the problem.

This one runs in a notebook without a problem even with the newest version but with my production dataset the CPU usage lingers at 20% and it newer completes

ogrisel commented 3 years ago

@jruokolainen can you please report the times you observe on your machine when running the reproducer you posted at https://github.com/scikit-learn-contrib/imbalanced-learn/issues/817#issuecomment-806719478 ?

both with:

ogrisel commented 3 years ago

I've been using SMOTETomek in production with success for a while.

@jruokolainen unrelated to the performance problem: out of curiosity, I would like to know more about practical applications of SMOTE in production: what kind of data are you working with? what kind of classifier do you use downstream in the pipeline? what is the class balancing ratio (5% vs 95% as in the reproducer)? what improvement in terms of balanced accuracy, F1, AUC or other metrics to you observe with SMOTETomek vs other balanced classification approaches (such as subsampling the majority class) or using BalancedRandomForest or LogisticRegression with class weights?

ogrisel commented 3 years ago

I can indeed observe a small perf regression on a smaller subset of the data when upgrading:

(imblearn-07) ogrisel@mba ~ % python tmp/debug_imbalanced_perf.py
Generated 15.1 MB of training data
SMOTETomek took 63.1 s and generated 28.6 MB
(imblearn-latest) ogrisel@mba ~ % python tmp/debug_imbalanced_perf.py
Generated 15.1 MB of training data
SMOTETomek took 76.2 s and generated 28.6 MB

I have tried both with joblib 1.0 and 0.17 for both and it does not seem to matter.

Here is the reproducer I used:

from time import perf_counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek

X, y = make_classification(
    n_samples=int(3e4), n_features=63,
    n_classes=2, weights=[0.05, 0.95],
    n_informative=4, random_state=0
)

print(f"Generated {X.nbytes / 1e6:.1f} MB of training data")
tic = perf_counter()
X_out, y_out = SMOTETomek(n_jobs=-1, random_state=0
    ).fit_resample(X, y)
toc = perf_counter()
print(f"SMOTETomek took {toc - tic:.1f} s and generated {X_out.nbytes / 1e6:.1f} MB")

You can try to increase the number of samples but the ratio of the runtimes seems to stay approximately constant.

It would be worth investigating what is the bottleneck with a profiler and report the regression upstream in scikit-learn if it can be reproduced only with scikit-learn code.

I suspect that using the Ball-Tree algorithm in the embedded nearest neighbors search on 63-dimensional data might be suboptimal. It would be worth checking with the brute force method and also with an approximate method such as https://github.com/lmcinnes/pynndescent but it requires significant code change.

ogrisel commented 3 years ago

It seems that both SMOTE and TomekLinks do their own knn model internally. Wouldn't there be a way to make the SMOTE model also return the nearest neighbor info for each resampled data point to avoid this?

ogrisel commented 3 years ago

The change in scikit-learn 0.24 that might explain the performance behavior for very large datasets with a large enough number of features might be:

https://github.com/scikit-learn/scikit-learn/pull/17148

So it could make sense to give the users the ability to switch the underlying NN search strategy (brute vs ball-tree) and maybe the heuristic used in 0.24 is not optimal...

chkoar commented 3 years ago

So it could make sense to give the users the ability to switch the underlying NN search strategy

Correct. The plan is to create backends for nearest neighbor searches. So we could leverage librarie like faiss or annoy without explicitly require them.

chkoar commented 3 years ago

The plan is

Well, probably, my thought was...

jruokolainen commented 3 years ago

I've been using SMOTETomek in production with success for a while.

@jruokolainen unrelated to the performance problem: out of curiosity, I would like to know more about practical applications of SMOTE in production: what kind of data are you working with? what kind of classifier do you use downstream in the pipeline? what is the class balancing ratio (5% vs 95% as in the reproducer)? what improvement in terms of balanced accuracy, F1, AUC or other metrics to you observe with SMOTETomek vs other balanced classification approaches (such as subsampling the majority class) or using BalancedRandomForest or LogisticRegression with class weights?

I can share some information about this. The dataset is aggregated website hit-level interaction data. The class balance ratio is shifting from 5%-95% to 1,5%-98,5% daily. The downstream classifier used is a LightGBM GOSS booster. I tested the BalancedRandomForest and LogisticRegression with class weights and LR with minority class upsampling (using KNN). But the GOSS model generalizes far better on unseen data. The downstream model parameters were tuned with Ray Tune (BOHB). Overall improvements across the classification metrics we're around 5-15% (ROC AUC improved 10% compared to LR with upsampling, the model is nearly perfect on the test dataset). Improvements on production we're approx. 5% compared to LR with upsampling, there is a lot of seasonality and fast changes in the real-world environment so the model is trained daily. In production we use the SMOTETomek balanced data with GOSS model and a LR with upsampled minority class data. We use probabilities of the two highest deciles from both of the models for add targeting. This yielded the best results based on our AB-test results. The overall improvement was quite drastic when comparing GOSS against LR in AB-testing. I cannot go into more detail unfortunately.

Mariamamb commented 3 years ago

I am having the same problem with Smotetomek when using a large dataset. It runs 4h before I killed the process. I am using imbalanced-learn 0.7.0 and scikit-learn 0.23.1. shape is (2264594, 78). but highly imbalanced

andrdpedro commented 3 years ago

I am having the same Problem with SMOTEENN, running for 10 hours until i killed it... someone found the solution?

lucamagnasco commented 2 years ago

Same error for SMOTETomek, in a dataset (266k, 25). Smote alone runs in less than a minute, smotetomek lasts more aprox an hour to run.

nurrrrx commented 2 years ago

Hi, I have high dimensional data, 2M rows x 520 features, imbalance is 35K vs 2M (pos vs negative). I tried simple fit_sample, and its still running since 6-8 hours now on a single core 32GB machine. is it a bad idea? Should I kill it? would going for AWS 512 GB with more cores help or GPU?

glemaitre commented 2 years ago

The version in master has support for passing duck-typed NN and thus GPU accelerated instances. Since we could not reproduce the original bug and we implemented the duck-typing, I will thus close this issue.

Elsa-gif commented 2 years ago

Hello, I'm trying to apply SMOTETOMEK to a base of size 2500000x32 but it runs endlessly. How to do?

Prashavu commented 2 years ago

Im having the same problem. I wasted a lot of time trying to run it, what is the solution? what version should we downgrade sklearn to?Now I had to use ordinary smote and take a hit on accuracy because this is not working , its fitting knn continuosly

The scikit-learn version is 1.0.2.

ogrisel commented 2 years ago

@Elsa-gif @Prashavu have you tried to pass an alternative nearest neighbors implementation?

from imblearn.over_sampling import SMOTE
from sklearn.neighbors import NearestNeighbors

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="kd_tree"))

or

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="ball_tree"))

or

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="brute"))
Prashavu commented 2 years ago

@ogrisel No I had not tried these. Howdifferent is thedefault SMOTE object from these ?

ogrisel commented 2 years ago

They should all behave the same but only faster or slower depending on the dimensionality (number of features) of the dataset and the number of cpu cores. Tree based neighbors computation should be faster than the bruteforce method in low dimensions (e.g. lower than 50 features).