scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.81k stars 1.28k forks source link

Using sample_weight in a pipeline #833

Closed FelixNeutatz closed 3 years ago

FelixNeutatz commented 3 years ago

Hello everybody,

I am a big fan of this library and I really happy about the Pipeline API. But I have the following problem: I want to use the sample_weight parameter together with the pipeline:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split as tts
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_sample_weight

X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

pca = PCA()
smt = SMOTE(random_state=42)
knn = RandomForestClassifier()
pipeline = Pipeline([('smt', smt), ('pca', pca), ('knn', knn)])
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)
pipeline.fit(X_train, y_train, sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))

Of course, this code does not work because y_train changes due to augmentation in the pipeline. The naive approach to fix would be to first generate the augmented dataset and then fit the classifier, but this approach would destroy the niceness of the Pipeline API. Does anybody have an idea how to have both the pipeline and the sample_weight parameter?

Best regards, Felix

FelixNeutatz commented 3 years ago

Hi everybody,

I found an easy way around it. You can just implement a wrapper around the classification class and adjust the fit().

Best regards, Felix