scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.29k forks source link

Applying sampling method to sensitive features for fairness models #1085

Open haytham918 opened 5 months ago

haytham918 commented 5 months ago

I am currently trying to incorporate imblearn's sampling methods such as SMOTE() and NearMiss() with ThresholdOptimizer and AdversarialFairnessClassifier from fairlearn. When I try to put all of them to run in imblearn.pipeline(sampling then classifier), the sampling step fails, which I guess it does not know what to do with the sensitive features we passed as metadata. Right now, I am twisting the work-flow to work this around, but I would like to know if there is a configuration or a feature that can easily solve this.

glemaitre commented 5 months ago

Could you provide a minimal example with toy data and the version of the different model.

glemaitre commented 5 months ago

This is highly possible that we need to modify our Pipeline implementation to be compatible with the metadata routing from scikit-learn.

haytham918 commented 5 months ago
import pandas as pd
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from fairlearn.adversarial import AdversarialFairnessClassifier
from sklearn.preprocessing import MinMaxScaler, Normalizer
from sklearn.model_selection import GridSearchCV
import sklearn
sklearn.set_config(enable_metadata_routing=True)

data = {
    'race': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
    'indicator': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1]
}

X = pd.DataFrame(data)

Y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

# Sensitive Featrues
Z = X['race']

mitigator = AdversarialFairnessClassifier(
    backend="torch",
    predictor_model=[50, "relu"],
    adversary_model=[3, "relu"],
    batch_size=2**8,
    progress_updates=0.5,
    random_state=123,
).set_fit_request(sensitive_features=True)

pipe = ImbPipeline([
  ("scaling", Normalizer()), ("sampling", SMOTE()), ("classifier", mitigator)])

param_grid = {

    "classifier__batch_size": [2**6]
}

grid_s = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")
grid_s.fit(X, Y, sensitive_features=Z)

Here is a piece of code that demonstrates the issue. But I also think the fairlearn's stuff has some incompatibility issue at this moment too