Add fairness - Githubissues

pplonski commented 1 year ago

Implement AutoML fairness based on https://arxiv.org/abs/2111.06495 by @qingyun-wu and @sonichi

Requirements:

fairness training will be triggered when sensitive attributes are set in AutoML fit method,
there should be several fairness metrics available with option to set custom metric,
fairness should be available in all four modes of MLJAR AutoML for all algorithms (including ensemble and stacked ensemble),
automatic documentation should have information about fairness metrics and include comparison of models,
provide unit tests to check new feature and performance tests,
update docs,
provide examples/tutorials.

Example code

X, y = load_training_data()

# init AutoML
automl = AutoML()

# case 1) training with sensitive attributes, use default fairness_metric
automl.fit(X, y, sensitive_features=["feature1", "feature2"])

# case 2) training with sensitive attributes and select fairness_metric
automl.fit(X, y, sensitive_features=["feature1", "feature2"], fairness_metric="equalized_odds")

# case 3) training with sensitive attributes and set custom fairness_metric
def custom fairness_metric(y_true, y_pred, sensitive_features, sample_weight=None):
    # implementation

automl.fit(X, y, sensitive_features=["feature1", "feature2"], fairness_metric=custom_fairness_metric)

codeboy5 commented 1 year ago

Hi @pplonski, I would be interested in contributing to this. Is there any way I can help ?

pplonski commented 1 year ago

Hi @codeboy5!

Thank you for your offer! I started to study FairAutoML more, I don't like the approach from the paper. It looks good in theory, but for real-life problems might be unusable. Just imagine mitigating unfairness when doing 10-fold Cross-Validation. I think that applying Exponentiated Gradient to each model in each fold might be very inefficient. I found that Exponentiated Griadent has trouble optimizing for more than 1 sensitive feature, for example, if you have 2 sensitive features (A and B), then a mitigated model might be fair for feature A but unfair for feature B...

I would like to have a method that will search for a sample weight that will provide fairness. Then I would like to reuse the same sample weight when doing hyperparameters search.

So I'm in the process of searching for a method for fair-optimal same weighting...

@codeboy5 do you have experience in fair ML or in optimization theory?

pplonski commented 1 year ago

I created a fairness module. It can compute fairness metrics and plots for binary classification tasks. It compute statistics for every sensitive feature separately.

Link to the module: https://github.com/mljar/mljar-supervised/tree/fairness/supervised/fairness

Example script that compute fairness metrics: https://github.com/mljar/mljar-supervised/blob/fairness/examples/scripts/binary_classifier_adult_fairness.py

The API:

automl = AutoML(algorithms=["Xgboost"])
automl.fit(X_train, y_train, sensitive_features=sensitive_features_train)

TODO:

The preprocessing for sensitive features should be improved. We need to remove rows with missing sensitive feature value and we should remove sensitive feature rows when target is missing.
Only split validation supports sensitive features right now. It should be extended to all supported validation strategies.
Better handle situations when senstive features and sample weights are provided.
Handle continuous sensitive features.

Example report with information about fairness:

fairness-metrics

pplonski commented 1 year ago

Improvements:

added fairness_metric, fariness_threshold and protected_groups in the API,
there are defaults set for fairness_metric and fairness_threshold,
changed selection rate plot to vertical, added background for fair area, fair area is computed based on fairness_threshold
changed the number format from % to 0.0000
added information about fairness in the leaderboard - fairness is reported separately for each sensitive feature

Questions:

maybe there should be protected_groups and unprotected_groups?

Demo: Peek 2023-05-01 13-34

pplonski commented 1 year ago

I think there should be at least 20 samples of same group to be considered in fairness mitigation. For example, we have a group defined by "Female", "Young<30", "Black", and there are only 5 samples for this group (0 samples with class 1). This group shouldn't be considered for computing fairness metrics, and shouldn't be considered for fairness mitigation.

pplonski commented 1 year ago

I've pushed the work in progress version for fairness mitigation. There are a lot of prints in the terminal - it is working version.

The algorithm is optimizing demographic parity ratio (it is hard coded). The output for mitigation for single feature (sex): single-feature-mitigation

The output for mitigation for two features (sex, is_young), is_young is categrocial feature craeted from age<50: two-features-mitigation

pplonski commented 1 year ago

TODO:

[x] add automatic threshold setup
[x] add privileged_groups and unprivileged_groups in the API
[x] add information about privileged and unprivileged values in the report
[x] automatically convert continuous sensitive features to binary
[x] add green shadow for fair values and red for unfair values, add text annotations
[x] add fairness scatter plots for leaderboard
[x] better model names
[x] weights must be positive values
[x] better step names
[x] detect automatically if we need to continue fairness optimization
[ ] detect not sufficient samples count for sensitive features
[x] select best algorithm based on fairness selection
[x] support all cross validation techniques (train/test split, cross-validation, and custom validation)
[ ] ~~support Optuna mode (plan is to first tune sample weights and then use Optuna to tune models)~~
[x] check Ensemble training
[x] check Stacked Ensemble training
[ ] support Equalized Odds optimization
[ ] add information about privileged and underprivileged groups provided by user in the optimization step
[ ] add preprocessing for sensitive features (rows drop when missing target)
[ ] add time constraint for bias mitigation step
[x] typo, it should be underprivileged_groups
[x] remove sentences about privileged groups for fair models
[x] use fair sample weights in not so random optimization step
[x] remove fairness threshold line for selection rate plot in the case of different metric than demographic_parity_ratio
[x] raise exception for default threshold value for regression when difference metric is used
[x] do not optimize for weights if they are the same as in previous step
[x] add information what can be further steps when AutoML cant construct fair model
[ ] add information to lower threshold

pplonski commented 1 year ago

If feature is not categorical it is automatically converted into binary feature based on equal samples number in each bin. We print the information in the terminal, example:

Sensitive features should be categorical
Apply automatic binarization for feature age
New values ['(37.0, 90.0]', '(16.999, 37.0]'] for feature age are applied

pplonski commented 1 year ago

The weights optimization stop condition is not yet implemented. This gives interesting behavior of algorithm. I was running the algorithm with privileged_groups and unprivileged_groups provided in API and DP Ratio goes above 1.0.

Please notice that below script is using two sensitive features sex and age. The privileged group is defined only for sex feature, and for this feature ratio is going above 1.0 (because there is no stop condition).

The age is passed as continuous features that is automatically converted into binary.

import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML
from sklearn.datasets import fetch_openml

data = fetch_openml(data_id=1590, as_frame=True)
y = (data.target == ">50K") * 1

X = data.data
y = (data.target == ">50K") * 1

sensitive_features = X[["sex", "age"]] 

X_train, X_test, y_train, y_test, S_train, S_test = train_test_split(
    X, y, sensitive_features, stratify=y, test_size=0.5, random_state=42
)

automl = AutoML(algorithms=["Xgboost"],
                train_ensemble=False,
                fairness_metric="demographic_parity_ratio",  # 
                fairness_threshold=0.8,
                privileged_groups = [{"sex": "Male"}],
                unprivileged_groups = [{"sex": "Female"}]
            )

automl.fit(X_train, y_train, sensitive_features=S_train)

Output: fairness-above-1 0

pplonski commented 1 year ago

I've run the AutoML with several algorithms on Adults dataset with two sensitive features. Below is an output from example script: Peek 2023-05-05 19-38

pplonski commented 1 year ago

Preview of Fair Ensemble

Peek 2023-05-17 13-09

pplonski commented 1 year ago

Issues:

[ ] nan values for fairness metrics for binary classification on categorical target
[ ] nan values for equalized odds metrics

mosaikme commented 9 months ago

Can we disable the Fairnes Metric?

pplonski commented 9 months ago

Hi @mosaikme,

fairness is only used when you pass sensitive_features in fit() otherwise it is skipped.

mljar / mljar-supervised

Add fairness #612

Requirements:

Example code

Example report with information about fairness:

Preview of Fair Ensemble