A curated list of awesome datasets with human label variation (un-aggregated labels) in Natural Language Processing and Computer Vision, including links to related initiatives and key references. The key focus of the table provided below is to collect datasets that contain multiple annotations per instance, to enable learning with human label variation/disagreement. The starting point of Table 1 was the table in the appendix of our paper.
If you know of resources or papers or links that are not yet listed, please help grow this resource. You can contribute by creating a pull request as outlined in contributing.md.
Please cite our paper Plank, 2022 EMNLP if you find this repository useful:
@inproceedings{plank-2022-emnlp,
title = "The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation",
author = "Plank, Barbara",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ",
month = December,
year = "2022",
address = "Abu Dhabi",
publisher = "Association for Computational Linguistics",
}
Icons refer to the following:
This list above are selected key references. Please see our EMNLP 2022 theme paper (Plank, 2022) for further references related to annotator culture/backgrounds, different terms proposed in the literature and more. If you know of relevant related work (not datasets), please leave an Issue. For more datasets, please see contributing.md
Reference | Name or Description | URL | Used in/Listed on |
---|---|---|---|
Passonneau et al., 2010 | Word sense disambiguation (WSD) | https://anc.org/ | |
Plank et al., 2014 | Part-of-Speech (POS) tagging, 500 tweets from Lowlands and Gimpel POS | https://bitbucket.org/lowlands/costsensitive-data/ or https://zenodo.org/record/5130737 | :mag:, :shrug: |
Derczynski et al., 2016 | Broad Named Entity Recognition (NER) Twitter dataset | https://github.com/GateNLP/broad_twitter_corpus | :pie: |
Rodrigues et al., 2018 | NER dataset, re-annoted sample of CoNLL 2003 | http://fprodrigues.com//publications/deep-crowds/ | |
Martinez Alonso et al., 2016 | Supersense tagging | https://github.com/coastalcph/semdax | |
Berzak et al., 2016 | Dependency Parsing, WSJ-23, 4 annotators | https://people.csail.mit.edu/berzak/agreement/ | |
Peng et al., 2022 | GCDT, Mandarin Chinese discourse treebank, small subsection with double annotations | https://github.com/logan-siyao-peng/GCDT/tree/main/data | |
Bryant and Ng, 2015 | Grammatical error correction | http://www.comp.nus.edu.sg/~nlp/sw/10gec_annotations.zip | |
Poesio et al. 2019 | PD (Phrase Detectives dataset): Anaphora and Information Status Classification | https://github.com/dali-ambiguity/Phrase-Detectives-Corpus-2.1.4 | :mag:, :shrug: |
Dumitrache et al. 2018 | Medical Relation Extraction (MRE) | https://github.com/CrowdTruth/Open-Domain-Relation-Extraction | :mag: |
Bassignana and Plank, 2022 | CrossRE, relation extraction, small doubly-annotated subset | https://github.com/mainlp/CrossRE | |
Dumitrache et al. 2018 | Frame Disambiguation | https://github.com/CrowdTruth/FrameDisambiguation | |
Snow et al. 2008 | RTE (recognizing textual entailment; 800 hypothesis-premise pairs) collected by Dagan et al. 2005, re-annotated; includes further datasets on temporal ordering, WSD, word similarity and affective text | https://sites.google.com/site/nlpannotations/ | :mag: |
Pavlick and Kwiatkowski 2019 | NLI (natural language inference) inherent disagreement dataset, approx. 500 RTE instances from Dagan et al. 2005 re-annotated by 50 annotators | https://github.com/epavlick/NLI-variation-data | |
Nie et al., 2020 | ChaosNLI, large NLI dataset re-annotated by 100 annotators | https://github.com/easonnie/ChaosNLI | |
Demszky et al., 2020 | GoEmotions: reddit comments annotated for 27 emotion categories or neutral | https://github.com/google-research/google-research/tree/master/goemotions | :eyeglasses: |
Ferracane et al., 2021 | Subjective discourse: conversation acts and intents | https://github.com/elisaF/subjective_discourse | |
Damgaard et al., 2021 | Understanding indirect answers to polar questions | https://github.com/friendsQIA/Friends_QIA | |
de Marneffe et al., 2019 | CommitmentBank: 8 annotations indicating the extent to which the speakers are committed to the truth of the embedded clause | https://github.com/mcdm/CommitmentBank | |
Kennedy et al., 2020 | Hate speech detection | https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech | :pie:, :eyeglasses: |
Dinu et al., 2021 | Pejorative words dataset | https://nlp.unibuc.ro/resources or http://pdai.info/ | :pie: |
Leonardelli et al., 2021 | MultiDomain Agreement, Offensive language detection on Twitter, 5 offensive/non-offensive labels; also part of Le-Wi-Di SemEval23 | https://github.com/dhfbk/annotators-agreement-dataset/ | :thumbsup: :thumbsdown:, :pie: |
Cercas Curry et al., 2021 | ConvAbuse, abusive language towards three conversational AI systems; also part of Le-Wi-Di SemEval23 | https://github.com/amandacurry/convabuse | :thumbsup: :thumbsdown:, :pie: |
Liu et al., 2019 | Work and Well-being Job-related Tweets, 5 annotators | https://github.com/Homan-Lab/pldl_data | :pie: |
Simpson et al., 2019 | Humour: pairwise funniness judgements | https://zenodo.org/record/5130737 | :shrug: |
Akhtar et al., 2019 | HS-brexit; Abusive Language on Brexit and annotated for hate speech (HS), aggressiveness and offensiveness, 6 annotators, extended and new parts part of Le-Wi-Di SemEval23 | https://le-wi-di.github.io/ | :thumbsup: :thumbsdown: |
Almanea and Poesio 2022 | ArMIS; New Le-Wi-Di SemEval23 dataset on Arabic tweets annotated for misogyny detection | https://le-wi-di.github.io/ | :thumbsup: :thumbsdown: |
Sap et al., 2022 | Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection | http://maartensap.com/racial-bias-hatespeech/ | |
Kumar et al., 2021 | Designing Toxic Content Classification for a Diversity of Perspectives | https://data.esrg.stanford.edu/study/toxicity-perspectives (contact author for password) | |
Nguyen et al., 2017 | Biomedical Infomation Retrieval, each doc is annotated by roughly 5 Amazon Mechanical Turk workers | https://github.com/yinfeiy/PICO-data | |
Zhang et al., 2022 | Chinese Sentiment Words Identification, each sentence is annotated by 3 ~ 5 workers | https://github.com/izhx/crowd-OEI | |
Grubenmann et al., 2018 | Sentiment annotations for Swiss German sentences | https://github.com/spinningbytes/SB-CH | |
Ji et al., 2022 | KiloGram tangram dataset, 10 annotations per tangram (EMNLP 2022 best long paper award) | https://github.com/lil-lab/kilogram | |
Kennedy et al., 2020 | The gab hate corpus: A collection of 27k posts annotated for hate speech. [#Labels: 2, #Unique Raters: 18, Atleast 3 annotations per instance] | https://osf.io/edua3/ | |
Haber et al., 2023 | SOA: Singapore online attacks, multilingual toxic data annotated with 3 annotators. | https://github.com/rewire-online/singapore-online-attacks/tree/main | |
Liu et al., 2022 | Word Associations with 19K explanations and 725 relation labels from 5 annotators | https://github.com/ChunhuaLiu596/WAX/ | |
Frermann et al., 2023 | Multi-label frame annotations of 428 news articles, each labeled by 2-3 annotators | https://github.com/phenixace/narrative-framing/tree/main/data | |
Sap et al., 2020 | Social Bias Frames: Reasoning about Social and Power Implications of Language (3 annotators) | https://maartensap.com/social-bias-frames/ | :small_orange_diamond: |
Fleisig et al., 2023 | FairPrism: Evaluating Fairness-Related Harms in Generated Text (3 annotators) | https://github.com/microsoft/FairPrism | |
Forbes et al., 2020 | Social Chemistry 101: Learning to Reason about Social and Moral Norms (up to 5 crowd annotations) | https://github.com/mbforbes/social-chemistry-101 | :small_orange_diamond: |
Lourie et al., 2021 | Scruples-dilemmas: A Corpus of Community Ethical Judgments (with 5 crowd annotations per instance) | https://github.com/allenai/scruples | :small_orange_diamond: |
Potts et al., 2021 | Dyna-Sentiment (5 crowd annotations) | https://github.com/cgpotts/dynasent | :small_orange_diamond: |
Danescu-Niculescu-Mizil et al. 2013 | Wikipedia Politeness (with up to 5 crowd annotations) | https://convokit.cornell.edu/documentation/wiki_politeness.html or https://github.com/minnesotanlp/Quantifying-Annotation-Disagreement | :small_orange_diamond: |
Madeddu et al., 2023 | DisaggregHateIt: A Disaggregated Italian Dataset of Hate Speech (1.1k tweets annotated for hate, irony, stance; between 1 and 13 annotations per instance) | https://github.com/madeddumarco/DisaggregHateIt |
Reference | Name or Description | URL | |
---|---|---|---|
Rodrigues et al. 2018 | LabelMe: Image classification dataset with 8 categories, re-annotated | http://fprodrigues.com//publications/deep-crowds/ | :mag:, :shrug: |
Peterson et al., 2019 | Cifar10H: Image classification with 10 categories, re-annotated | http://github.com/jcpeterson/cifar-10h | :mag:, :shrug: |
Cheplygina et al. 2018 | Medical lesion classification challenge, 6 annotators each | https://figshare.com/s/5cbbce14647b66286544 | |
Wei, Zhu et al., 2022 | CIFAR-100N | http://noisylabels.com/ | |
Nguyen et al. 2020 | VinDR-CXR: Object detection dataset on chest x-ray images, each training image labeled by 3 annotators | https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/ or https://vindr.ai/datasets/cxr | |
Tschirschwitz et al. 2022 | TexBiG: Instance segmentation dataset on historical layout analysis, each training image labeled by 2-4 annotators | https://zenodo.org/record/8347059 or https://www.kaggle.com/datasets/davidtschirschwitz/texbig-v2-0-train-val |