juand-r / entity-recognition-datasets

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
MIT License
1.51k stars 247 forks source link

GUM version #11

Open amir-zeldes opened 4 years ago

amir-zeldes commented 4 years ago

I just ran into this list - thanks for putting it up. I curate the GUM corpus included in the data folder, but it seems to be a rather old version. We now have much more data, including four more genres and bringing up the total word count to about 130,000 tokens annotated for nested, (non-)named entities. Would you like to update the data to include the latest version?

juand-r commented 4 years ago

Thanks for the suggestion.

I can add the newest version of GUM to the table in the README, as well as a copy in the data folder (but in a different format than the older GUM, since I used the CoNLL 2003 format before).

Also, I was thinking I could leave the old version up... in case people need to compare results with work done using that version.

amir-zeldes commented 4 years ago

OK - I had a quick look at the data to see the format you're using, and I noticed a few issues with the data that might cause problems:

teams   B-organization
of  I-organization
video   I-organization
gamers  I-organization

GUM has established file splits, which you can find here: https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/file-lists

These are the same splits used in the conll shared task on UD parsing, so I'd recommend using the same splits for NER too.

juand-r commented 4 years ago

Thanks for the LitBank reference. I agree on the benefits of both nested NER annotation, and on using the surrounding context of sentences (I was only training at the individual sentence level and using BIO annotations when I started this, but am glad people are moving beyond that).

I was not aware that GUM had trail/test/dev splits -- thanks for pointing that out. I'll use the established file splits.

I was also thinking of structuring this a bit better, indicating which datasets have nested entity encoding, as well as other relevant details.

amir-zeldes commented 4 years ago

OK, thanks - let me know if you need any input or help figuring out the GUM documentation. The Coptic dataset also has nested (N)NER, in the same conllu tabs+brackets format.