princeton-nlp / MABEL

EMNLP 2022: "MABEL: Attenuating Gender Bias using Textual Entailment Data" https://arxiv.org/abs/2210.14975
MIT License
37 stars 2 forks source link

Ask for datasets help #2

Closed yqw0710 closed 1 year ago

yqw0710 commented 1 year ago

Hi,I read your article and found that the experimental results were very effective. I learned that your training datasets came from MNLI and SNLI, but I didn't find the specific preprocessing steps. Could you please provide the preprocessed code?Thank you very much!

aditya-anulekh commented 1 year ago

Hey! Did you manage to find the preprocessing steps for generating the dataset? If so, could you please point me to them? Thanks!

jacqueline-he commented 1 year ago

The dataset (as linked in the README) can be found here.

I have since graduated and no longer have access to the original preprocessing script, but I'd imagine that recreating it should be pretty simple. You can download the SNLI or MNLI dataset from HuggingFace, and filter for the entailment pairs (so the premise and hypothesis would be orig_sent0 and orig_sent1, respectively). Then, apply CDA using the 10 or so gender word pairs listed in Appendix A of the paper - the gender-flipped premise and hypothesis would then be aug_sent0 and aug_sent1, respectively. Finally, the column for both is 1 if both the premise and the hypothesis have gendered words that are flipped, and 0 otherwise (this affects the computation of one of the losses).

Let me know if you have any other questions.