princeton-nlp / MABEL

EMNLP 2022: "MABEL: Attenuating Gender Bias using Textual Entailment Data" https://arxiv.org/abs/2210.14975
MIT License
37 stars 2 forks source link
contrastive-learning fairness gender-bias natural-language-processing

MABEL: Attenuating Gender Bias using Textual Entailment Data

Authors: Jacqueline He, Mengzhou Xia, Christiane Fellbaum, Danqi Chen

This repository contains the code for our EMNLP 2022 paper, "MABEL: Attenuating Gender Bias using Textual Entailment Data".

MABEL (a Method for Attenuating Bias using Entailment Labels) is a task-agnostic intermediate pre-training technique that leverages entailment pairs from NLI data to produce representations which are both semantically capable and fair. This approach exhibits a good fairness-performance tradeoff across intrinsic and extrinsic gender bias diagnostics, with minimal damage on natural language understanding tasks.

Training Schema

Table of Contents

Quick Start

With the transformers package installed, you can import the off-the-shelf model like so:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/mabel-bert-base-uncased")

model = AutoModelForMaskedLM.from_pretrained("princeton-nlp/mabel-bert-base-uncased")

Model List

MABEL Models ICAT ↑
princeton-nlp/mabel-bert-base-uncased 73.98
princeton-nlp/mabel-bert-large-uncased 73.45
princeton-nlp/mabel-roberta-base 69.68
princeton-nlp/mabel-roberta-large 69.49

Note: The ICAT score is a bias metric that consolidates a model's capacity for language modeling and stereotypical association into a single numerical indicator. More information can be found in the StereoSet (Nadeem et al., 2021) paper.

Training

Before training, make sure that the counterfactually-augmented NLI data, processed from SNLI and MNLI, is downloaded and stored under the training directory as entailment_data.csv.

1. Install package dependencies

pip install -r requirements.txt

2. Run training script

cd training
chmod +x run.sh 
./run.sh

You can configure the hyper-parameters in run.sh accordingly. Models are saved to out/. The optimal set of hyper-parameters varies depending on the choice of backbone encoder, and the full training details can be found in the paper.

Evaluation

Intrinsic Metrics

If you use your own trained model instead of our provided HF checkpoint, you must first run python -m training.convert_to_hf --path /path/to/your/checkpoint --base-model bert (which converts the checkpoint to a standard BertForMaskedLM model - use --base_model roberta for RobertaForMaskedLM) prior to intrinsic evaluation.

Also, please note that we use Meade et al.'s method of computation and datasets for both StereoSet and CrowS-Pairs; this is why the metrics for the pre-trained models are not directly comparable to those reported in the original benchmark papers.

1. StereoSet (Nadeem et al., 2021)

Command:

python -m benchmark.intrinsic.stereoset.predict --model_name_or_path princeton-nlp/mabel-bert-base-uncased && 
python -m benchmark.intrinsic.stereoset.eval

Output:

intrasentence
gender
Count: 2313.0
LM Score: 84.5453251710623
SS Score: 56.248299466465376
ICAT Score: 73.98003496789251
Collective Results: Models LM ↑ SS ◇ ICAT ↑
bert-base-uncased 84.17 60.28 66.86
princeton-nlp/mabel-bert-base-uncased 84.54 56.25 73.98
bert-large-uncased 86.54 63.24 63.62
princeton-nlp/mabel-bert-large-uncased 84.93 56.76 73.45
roberta-base 88.93 66.32 59.90
princeton-nlp/mabel-roberta-base 87.44 60.14 69.68
roberta-large 88.81 66.82 58.92
princeton-nlp/mabel-roberta-large 89.72 61.28 69.49

◇: The closer to 50, the better.

2. CrowS-Pairs (Nangia et al., 2021)

Command:

python -m benchmark.intrinsic.crows.eval --model_name_or_path princeton-nlp/mabel-bert-base-uncased

Output:

====================================================================================================
Total examples: 262
Metric score: 50.76
Stereotype score: 51.57
Anti-stereotype score: 49.51
Num. neutral: 0.0
====================================================================================================
Collective Results: Models Metric Score ◇
bert-base-uncased 57.25
princeton-nlp/mabel-bert-base-uncased 50.76
bert-large-uncased 55.73
princeton-nlp/mabel-bert-large-uncased 51.15
roberta-base 60.15
princeton-nlp/mabel-roberta-base 49.04
roberta-large 60.15
princeton-nlp/mabel-roberta-large 54.41

◇: The closer to 50, the better.

Extrinsic Metrics

  1. Occupation Classification

See benchmark/extrinsic/occ_cls/README.md for full training instructions and results.

  1. Natural Language Inference

See benchmark/extrinsic/nli/README.md for full training instructions and results.

  1. Coreference Resolution

See benchmark/extrinsic/coref/README.md for full training instructions and results.

Language Understanding

1. GLUE (Wang et al., 2018)

We fine-tune on GLUE through the transformers library, following the default hyper-parameters.

A straightforward way is to download the current transformers repository:

git clone https://github.com/huggingface/transformers
cd transformers
pip install .

Then set up the environment dependencies:

cd ./examples/pytorch/text-classification
pip install -r requirements.txt

Here is a sample script for one of the GLUE tasks, MRPC:

# task options: cola, sst2, mrpc, stsb, qqp, mnli, qnli, rte 
export TASK_NAME=mrpc
export OUTPUT_DIR=out/

CUDA_VISIBLE_DEVICES=0 python run_glue.py \
  --model_name_or_path princeton-nlp/mabel-bert-base-uncased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir $OUTPUT_DIR

2. SentEval Transfer Tasks (Conneau et al., 2018)

Preprocess:

Make sure you have cloned the SentEval repo and added its contents into this repository's transfer folder, and run ./get_transfer_data.bash in data/downstream to download the evaluation data.

Command:

python -m benchmark.transfer.eval --model_name_or_path princeton-nlp/mabel-bert-base-uncased --task_set transfer

Output:

+-------+-------+-------+-------+-------+-------+-------+-------+
|   MR  |   CR  |  SUBJ |  MPQA |  SST2 |  TREC |  MRPC |  Avg. |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 78.33 | 85.83 | 93.78 | 89.13 | 85.50 | 85.20 | 68.87 | 83.81 |
+-------+-------+-------+-------+-------+-------+-------+-------+
Collective Results: Models Transfer Avg. ↑
bert-base-uncased 83.73
princeton-nlp/mabel-bert-base-uncased 83.81
bert-large-uncased 86.54
princeton-nlp/mabel-bert-large-uncased 86.09

Code Acknowledgements

Citation

@inproceedings{he2022mabel,
   title={{MABEL}: Attenuating Gender Bias using Textual Entailment Data},
   author={He, Jacqueline and Xia, Mengzhou and Fellbaum, Christiane and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2022}
}