===============================
Datasets for Entity Recognition
This repository contains datasets from several domains
annotated with a variety of entity types, useful for entity recognition and
named entity recognition (NER) tasks.
NOTE: I am no longer actively adding datasets to this list -- there are likely more NER datasets that have appeared since 2020. However, I am happy to add more datasets via issues or pull requests.
Datasets for NER in English
.. |check| unicode:: 0x2714
The following table shows the list of datasets for English-language entity recognition (for a list of NER datasets in other languages, see below). The data
directory
contains information on where to obtain those datasets which could not be shared
due to licensing restrictions, as well as code to convert them (if necessary)
to the CoNLL 2003 format. Links to NER corpora in other languages
are also listed below.
============== =============== ======================= =============================== ==================================
Dataset Domain License Reference Availablility
============== =============== ======================= =============================== ==================================
CONLL 2003 News DUA Sang and Meulder, 2003 Easy <https://github.com/patverga/torch-ner-nlp-from-scratch/tree/master/data/conll2003/>
to <https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003>
find <https://github.com/glample/tagger/tree/master/dataset>
_
NIST-IEER News None NIST 1999 IE-ER NLTK data <https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/ieer.zip>
MUC-6 News LDC Grishman and Sundheim, 1996 LDC 2003T13 <https://catalog.ldc.upenn.edu/LDC2003T13>
OntoNotes 5 Various LDC Weischedel et al., 2013 LDC 2013T19 <https://catalog.ldc.upenn.edu/LDC2013T19>
BBN Various LDC Weischedel and Brunstein, 2005 LDC 2005T33 <https://catalog.ldc.upenn.edu/LDC2005T33>
GMB-1.0.0 Various None Bos et al., 2017 http://gmb.let.rug.nl/data.php <http://gmb.let.rug.nl/releases/gmb-1.0.0.zip>
_
GUM-3.1.0 Wiki Several (2) Zeldes, 2016 |check| Included here
wikigold Wikipedia CC-BY 4.0 Balasuriya et al., 2009 |check| Included here
Ritter Twitter None Ritter et al., 2011 No split <https://github.com/aritter/twitter_nlp/blob/master/data/annotated/ner.txt>
_ , Train/test/dev split <https://github.com/aritter/twitter_nlp/tree/master/data/annotated/wnut16/data>
BTC Twitter CC-BY 4.0 Derczynski et al., 2016 |check| Included here
WNUT17 Social media CC-BY 4.0 Derczynski et al., 2017 |check| Included here
i2b2-2006 Medical DUA Uzuner et al., 2007 http://www.i2b2.org <https://www.i2b2.org/NLP/DataSets/Main.php>
i2b2-2014 Medical DUA Stubbs et al., 2015 http://www.i2b2.org <https://www.i2b2.org/NLP/DataSets/Main.php>
CADEC Medical CSIRO Karimi et al., 2015 http://data.csiro.au/
AnEM Anatomical CC-BY-SA 3.0 Ohta et al., 2012 |check| Included here
MITRestaurant Queries None Liu et al., 2013a http://groups.csail.mit.edu/sls/ <https://groups.csail.mit.edu/sls/downloads/restaurant/>
MITMovie Queries None Liu et al., 2013b http://groups.csail.mit.edu/sls/ <https://groups.csail.mit.edu/sls/downloads/movie/>
MalwareTextDB Malware None Lim et al., 2017 http://www.statnlp.org/ <http://www.statnlp.org/research/re/MalwareTextDB-1.0.zip>
re3d Defense Several (1) DSTL, 2017 |check| Included here
SEC-filings Finance CC-BY 3.0 Alvarado et al., 2015 |check| Included here
Assembly Robotics X Costa et al., 2017 X
WikiNEuRal Wikipedia CC BY-SA-NC 4.0 Tedeschi et al., 2021 https://github.com/Babelscape/wikineural
MultiNERD Wikipedia CC BY-SA-NC 4.0 Tedeschi et al., 2022 https://github.com/Babelscape/multinerd
HIPE-2022 Historical CC BY-SA-NC 4.0 Ehrmann et al., 2022 https://github.com/hipe-eval/HIPE-2022-data
Music-NER Music MIT Epure and Hennequin, 2023 https://github.com/deezer/music-ner-eacl2023
WIESP2022-NER Astrophysics CC BY-SA-NC 4.0 Grezes et al., 2022 https://huggingface.co/datasets/adsabs/WIESP2022-NER
NNE News CC 4.0 / LDC Ringland et al., 2019 https://github.com/nickyringland/nested_named_entities
WorldWide News CC BY-SA-NC 4.0 Shan et al., 2023 https://github.com/stanfordnlp/en-worldwide-newswire https://arxiv.org/abs/2404.13465
============== =============== ======================= =============================== ==================================
Licenses
Notes on licenses:
(1) re3d ("Relationship and Entity Extraction Evaluation Dataset") contains
several datasets, with different licenses. These are:
- CC-BY-SA 3.0 (Wikipedia dataset)
- CC BY-NC 3.0 (BBC_Online dataset)
- CC BY 3.0 AU (Australian_Department_of_Foreign_Affairs dataset)
- public domain (US_State_Department dataset, CENTCOM dataset)
- UK Open Government Licence v3.0 (UK_Government dataset)
- Delegation_of_the_European_Union_to_Syria: see
https://eeas.europa.eu/delegations/syria/8157/legal-notice_en
(2) GUM 3.1.0 comprises three datasets, with licenses CC-BY 3.0, CC-BY-SA 3.0 and
CC-BY-NC-SA 3.0. The annotations are licensed under CC-BY 4.0.
More detailed license information for each dataset can be found in
the corresponding subdirectory.
Later ...
Datasets for NER in other languages
Lexical Named Entity resources
Code-Switching
German
- CoNLL 2003 (English, German): https://www.clips.uantwerpen.be/conll2003/ner/
- GermEval 2014: https://sites.google.com/site/germeval2014ner/data
- Tübingen Treebank of Written German (TüBa-D/Z): http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-dz.html
- Europeana Newspapers (Dutch, French, German): https://github.com/EuropeanaNewspapers/ner-corpora ; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- German EUROPARL transcripts (subset): https://nlpado.de/~sebastian/software/ner_german.shtml
- Named Entity Model for German, Politics (NEMGP): https://www.thomas-zastrow.de/nlp/
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- DFKI SmartData Corpus (geo-entities): https://dfki-lt-re-group.bitbucket.io/smartdata-corpus/ (A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events. Martin Schiersch, Veselina Mironova, Maximilian Schmitt, Philippe Thomas, Aleksandra Gabryszak, Leonhard Hennig. Proceedings of LREC, 2018)
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation
- Elena Leitner, Georg Rehm, Juli ́an Moreno-Schneider, A Dataset of German Legal Documents for Named Entity Recognition, LREC 2020: http://georg-re.hm/pdf/LREC-2020-Leitner-et-al-preprint.pdf ; Data: https://github.com/elenanereiss/Legal-Entity-Recognition
- HIPE-2022, named entity recognition and entity linking in multilingual historical documents: https://hipe-eval.github.io/HIPE-2022/ https://github.com/hipe-eval/HIPE-2022-data
Dutch
- CoNLL 2002 (Spanish, Dutch): https://www.clips.uantwerpen.be/conll2002/ner/
- Europeana Newspapers (Dutch, French, German): https://github.com/EuropeanaNewspapers/ner-corpora ; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- MEANTIME Corpus (Parallel corpus: English, Spanish, Italian, Dutch): http://www.newsreader-project.eu/results/data/wikinews/
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- Dutch parliamentary documents 2015-2016, from 1848.nl (Jonkers, Named Entity Recognition on Dutch Parliamentary Documents using Frog, thesis, University of Amsterdam, 2016): https://github.com/Poezedoez/NER/blob/master/Code/data/lobby/golden_standard
- SONAR 1 - Desmet and Hoste, Fine-grained Dutch named entity recognition, 2014 (hierarchy of classes)
- Corpus-SONAR books and Corpus Gutenberg Dutch: http://blog.namescape.nl/?page_id=85 ; http://portal.clarin.nl/node/1940
Afrikaans
Spanish
- CoNLL 2002 (Spanish, Dutch): https://www.clips.uantwerpen.be/conll2002/ner/
- AnCora (Spanish, Catalan): http://clic.ub.edu/corpus/en
- DEFT Spanish Treebank (LDC2018T01): https://catalog.ldc.upenn.edu/LDC2018T01
- PANACEA (LAB): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-es
- PANACEA (ENV): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-es
- MEANTIME Corpus (Parallel corpus: English, Spanish, Italian, Dutch): http://www.newsreader-project.eu/results/data/wikinews/
- ACE 2007 (Spanish and Arabic): https://catalog.ldc.upenn.edu/LDC2014T18
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- http://www.grupolys.org/~marcos/pub/lrec16.tar.bz2 (used in "Incorporating Lexico-semantic Heuristics into Coreference Resolution Sieves for Named Entity Recognition at Document-level")
- Multilingual corpora with coreferential annotation of person entities (Spanish, Galician, Portuguese): http://gramatica.usc.es/~marcos/lrec.tar.bz2
- DrugSemantics Gold Standard (Moreno et al., DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics, 2017): https://data.mendeley.com/datasets/fwc7jrc5jr/1
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation
- CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition) - named entity recognition of a critical type of concept related to cancer, namely tumor morphology in Spanish medical texts: https://temu.bsc.es/cantemist/
Catalan
Galician
Basque
Portuguese
French
- ESTER: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0241/
- ESTER 2: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0338/
- ETAPE: http://catalogue.elra.info/en-us/repository/browse/ELRA-E0046/
- Europeana Newspapers (Dutch, French, German): https://github.com/EuropeanaNewspapers/ner-corpora ; http://lab.kb.nl/dataset/europeana-newspapers-ner#access
- QUAERO French Medical Corpus: https://quaerofrenchmed.limsi.fr/
- Quaero Broadcast News Extended Named Entity Corpus: http://catalog.elra.info/en-us/repository/browse/ELRA-S0349/
- Quaero Old Press Extended Named Entity corpus: http://catalog.elra.info/en-us/repository/browse/ELRA-W0073/
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNER-fr-gold https://arxiv.org/abs/2411.00030 https://huggingface.co/datasets/danrun/WikiNER-fr-gold
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation
- CAp 2017 - (Twitter data), Lopez et al., CAp 2017 challenge: Twitter Named Entity Recognition, 2017: http://cap2017.imag.fr/competition.html
- HIPE-2022, named entity recognition and entity linking in multilingual historical documents: https://hipe-eval.github.io/HIPE-2022/ https://github.com/hipe-eval/HIPE-2022-data
Italian
- KIND: https://github.com/dhfbk/KIND
- Evalita: http://www.evalita.it/2009/tasks/entity
- MEANTIME Corpus (Parallel corpus: English, Spanish, Italian, Dutch): http://www.newsreader-project.eu/results/data/wikinews/
- PANACEA (ENV): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-it
- PANACEA (LAB): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-it
- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- WikiNEuRal: https://github.com/Babelscape/wikineural
- MultiNERD: https://github.com/Babelscape/multinerd
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation
Romanian
Greek
Hungarian
Czech
Polish
Croatian
Slovak
Slovene
Ukrainian
Serbian
Bulgarian
Icelandic
- MIM-GOLD-NER (Ingólfsdóttir, Svanhvít Lilja, Sigurjón Þorsteinsson, and Hrafn Loftsson. "Towards High Accuracy Named Entity Recognition for Icelandic." Proceedings of the 22nd Nordic Conference on Computational Linguistics. 2019): http://www.malfong.is/index.php?pg=mim_gold_ner
Danish
Norwegian
Swedish
Finnish
Estonian
Latvian and Lithuanian
Turkish
Kazakh
Uyghur
- Uyghur Named Entity Relation corpus: https://github.com/kaharjan/UyNeRel (Abiderexiti et al., Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus. IALP 2016)
Armenian
Coptic
Amharic
Arabic
- AQMAR Arabic Wikipedia Named Entity Corpus: http://www.cs.cmu.edu/~ark/ArabicNER/
- NE3L named entities Arabic corpus (Arabic, Chinese, Russian): http://catalog.elra.info/en-us/repository/browse/ELRA-W0078/
- REFLEX Entity Translation (Parallel corpus: English, Arabic, Chinese): https://catalog.ldc.upenn.edu/LDC2009T11
- ANERCorp: http://users.dsic.upv.es/~ybenajiba/downloads.html (See also: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html)
- ACE 2003 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2004T09
- ACE 2004 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2005T09
- ACE 2005 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2006T06
- ACE 2007 (Spanish and Arabic): https://catalog.ldc.upenn.edu/LDC2014T18
- OntoNotes 5 (English, Arabic, Chinese): https://catalog.ldc.upenn.edu/LDC2013T19
- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation
- Wojood - 2022 Nested Arabic Named Entity Corpus. https://dlnlp.ai/st/wojood/ https://aclanthology.org/2022.lrec-1.387.pdf https://codalab.lisn.upsaclay.fr/competitions/11740
Persian
Sindhi
Urdu
Indic
Hindi
Bengali
Telugu
Maithili
Nepali
Marathi
Punjabi
Tamil
Malayalam
Oriya/Odia
Sinhala/Sinhalese
Thai
Indonesian
Vietnamese
Japanese
- IREX: https://nlp.cs.nyu.edu/irex/Package/
- MET-2 (Japanese, Chinese): https://www-nlpir.nist.gov/related_projects/muc/
- BCCWJ Basic NE corpus: https://sites.google.com/site/projectnextnlpne/en (Iwakura et al., Constructing a Japanese Basic Named Entity Corpus of Various Genres, NEWS 2016)
- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/
- Data from: Mai et al., An Empirical Study on Fine-Grained Named Entity Recognition, COLING 2018 (English, Japanese): https://fgner.alt.ai/duc/ene/testsets/comp/
- Wikipedia NER Corpus: https://github.com/stockmarkteam/ner-wikipedia-dataset
- WikiANN: https://elisa-ie.github.io/wikiann/
- GSD: Conversion of the UD GSD dataset to named entities by Megagon Labs https://github.com/megagonlabs/UD_Japanese-GSD
- KWDLC: Kyoto University Web Document Leads Corpus https://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KWDLC https://github.com/ku-nlp/KWDLC https://nagisa.readthedocs.io/en/latest/tutorial_ner.html
Korean
Chinese
- ACE 2003 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2004T09
- ACE 2004 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2005T09
- ACE 2005 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2006T06
- OntoNotes 5 (English, Arabic, Chinese): https://catalog.ldc.upenn.edu/LDC2013T19
- MET-2 (Japanese, Chinese): https://www-nlpir.nist.gov/related_projects/muc/
- REFLEX Entity Translation (Parallel corpus: English, Arabic, Chinese): https://catalog.ldc.upenn.edu/LDC2009T11
- NE3L named entities Chinese corpus (Arabic, Chinese, Russian): http://catalogue.elra.info/en-us/repository/browse/ELRA-W0079/
- Original Short-Message Data Collation I in Chinese (named entities): http://catalog.elra.info/en-us/repository/browse/ELRA-W0045_04/
- Original Short-Message Data Collation II in Chinese (named entities): http://catalog.elra.info/en-us/repository/browse/ELRA-W0045_08/
- ERE DEFT Corpora (Parallel corpus: English, Chinese): Mott et al., Parallel Chinese-English Entities, Relations and Events Corpora, 2016 (LDC2015E78 , LDC2014E114)
- Chinese Weibo: DEFT ERE style annotations for named and nominal mentions on Chinese social media (Weibo): https://github.com/hltcoe/golden-horse
- Chinese EduNER: 2023 dataset in the Education domain: https://link.springer.com/article/10.1007/s00521-023-08635-5 https://github.com/anonymous-xl/eduner
- Chinese Aerospace NER: https://www.nature.com/articles/s41598-023-50705-0 https://github.com/Coder-XIAOKAI/Aerospace_NERdatasets
- SciCN: A Chinese Dataset and Benchmark for Scientific Information Extraction https://file.techscience.com/files/cmc/2024/TSP_CMC-78-3/TSP_CMC_35594/TSP_CMC_35594.pdf https://github.com/yangjingla/SciCN
- EMP NER: Historical Chinese https://aclanthology.org/2024.lrec-main.35.pdf https://gitlab.com/enpchina/ENP-NER
Tagalog
Russian
Yoruba
Swahili
Igbo
isiNdebele
Xhosa
Zulu
Sepedi
Sesotho
Setswana
Siswati
Venda
Xitsonga
Latin
A long list can be found here: http://damien.nouvels.net/resourcesen/corpora.html
References
[Alvarado et al., 2015] Alvarado, Julio Cesar Salinas, Karin Verspoor,
and Timothy Baldwin. Domain adaption of named entity recognition to support
credit risk assessment. In Proceedings of the Australasian Language Technology
Association Workshop 2015, pp. 84-90. 2015.
Accessed: August 2018.
[Balasuriya et al., 2009] Balasuriya, Dominic, Nicky Ringland, Joel Nothman,
Tara Murphy, and James R. Curran. Named entity recognition in wikipedia. In
Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively
Constructed Semantic Resources, pp. 10-18. Association for Computational
Linguistics, 2009
[Bos et al., 2017] Bos, Johan, Valerio Basile, Kilian Evang,
Noortje J. Venhuizen, and Johannes Bjerva. The Groningen meaning bank.
In Handbook of linguistic annotation, pp. 463-496. Springer, Dordrecht, 2017.
[Derczynski et al., 2016] Derczynski, Leon, Kalina Bontcheva, and Ian Roberts.
Broad twitter corpus: A diverse named entity recognition resource. In
Proceedings of COLING 2016, the 26th International Conference on Computational
Linguistics: Technical Papers, pp. 1169-1179. 2016.
Available at: https://github.com/GateNLP/broad_twitter_corpus
Accessed: August 2018.
[Derczynski et al., 2017] Leon Derczynski, Eric Nichols, Marieke van Erp,
Nut Limsopatham (2017) Results of the WNUT2017 Shared Task on Novel and
Emerging Entity Recognition, in Proceedings of the 3rd Workshop on Noisy,
User-generated Text.
Available at: https://noisy-text.github.io/2017/emerging-rare-entities.html
[DSTL, 2017] Defence Science and Technology Laboratory. 2017. Relationship and
Entity Extraction Evaluation Dataset. https://github.com/dstl/re3d.
Accessed: January 2018.
[Grishman and Sundheim, 1996] Ralph Grishman and Beth Sundheim. 1996.
Message understanding conference- 6: A brief history. In COLING 1996 Volume 1:
The 16th International Conference on Computational Linguistics.
[Karimi et al., 2015] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp,
and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations.
Journal of biomedical informatics, 55:73-81. Available at https://data.csiro.au
Accessed: November 2017.
[Lim et al., 2017] Lim, Swee Kiat, Aldrian Obaja Muis, Wei Lu, and
Chen Hui Ong. MalwareTextDB: A database for annotated malware articles.
In Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), vol. 1, pp. 1557-1567. 2017.
[Liu et al., 2013a] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and
Jim Glass. 2013. Asgard: A portable architecture for multilingual dialogue
systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE
International Conference on, pages 8386-8390. IEEE.
Available at https://groups.csail.mit.edu/sls/downloads/restaurant/
Accessed: January 2018
[Liu et al., 2013b] Jingjing Liu, Panupong Pasupat, Yining Wang, Scott Cyphers,
and Jim Glass. 2013. Query understanding enhanced by hierarchical parsing
structures. In Automatic Speech Recognition and Understanding (ASRU),
2013 IEEE Workshop on, pages 72-77. IEEE.
Available at https://groups.csail.mit.edu/sls/downloads/movie/
We used the trivia10k13 portion. Accessed: January 2018
[NIST, 1999 IE-ER] NIST. 1999. Information Extraction - Entity Recognition
Evaluation. http://www.nist.gov/speech/tests/ieer/er_99/er_99.htm.
The newswire development test data only (included in the NLTK package).
[Ohta et al., 2012] Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii and Sophia
Ananiadou. 2012. Open-domain Anatomical Entity Mention Detection. In
Proceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse
(DSSD), pp. 27-36.
Available at: http://www.nactem.ac.uk/anatomy/ and
https://github.com/openbiocorpora/anem Accessed: November 2017.
[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011.
Named entity recognition in tweets: An experimental study. In Proceedings of
the 2011 Conference on Empirical Methods in Natural Language Processing,
pages 1524-1534, Edinburgh, Scotland, UK., July. Association for Computational
Linguistics.
Accessed January 2018.
[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003.
Introduction to the CoNLL-2003 shared task: Languageindependent named entity
recognition. In Proceedings of the Seventh Conference on Natural Language
Learning at HLT-NAACL 2003.
[Stubbs et al., 2015] Amber Stubbs and Ozlem Uzuner. 2015. Annotating
longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth
corpus. Journal of biomedical informatics, 58:S20-S29. Available at
https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.
[Uzuner et al., 2007] Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007.
Evaluating the state-of-the-art in automatic de-identification. Journal of the
American Medical Informatics Association, 14(5):550-563. Available at
https://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.
[Weischedel and Brunstein, 2005] Ralph Weischedel and Ada Brunstein. 2005.
BBN pronoun coreference and entity type corpus. Linguistic Data Consortium,
Philadelphia.
[Weischedel et al., 2013] Weischedel, Ralph, Martha Palmer, Mitchell Marcus,
Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. Ontonotes
release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).
[Zeldes, 2017] Amir Zeldes. 2017. The GUM corpus: creating multilayer
resources in the classroom. Language Resources and Evaluation, 51(3):581-612.
Available at https://github.com/amir-zeldes/gum/tree/master/coref/tsv/
Accessed: November 2017.