Open amir-zeldes opened 4 years ago
Thanks for the suggestion.
I can add the newest version of GUM to the table in the README, as well as a copy in the data folder (but in a different format than the older GUM, since I used the CoNLL 2003 format before).
Also, I was thinking I could leave the old version up... in case people need to compare results with work done using that version.
OK - I had a quick look at the data to see the format you're using, and I noticed a few issues with the data that might cause problems:
person
within 'teams of video gamers' (which are organization
):teams B-organization
of I-organization
video I-organization
gamers I-organization
GUM has established file splits, which you can find here: https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/file-lists
These are the same splits used in the conll shared task on UD parsing, so I'd recommend using the same splits for NER too.
Thanks for the LitBank reference. I agree on the benefits of both nested NER annotation, and on using the surrounding context of sentences (I was only training at the individual sentence level and using BIO annotations when I started this, but am glad people are moving beyond that).
I was not aware that GUM had trail/test/dev splits -- thanks for pointing that out. I'll use the established file splits.
I was also thinking of structuring this a bit better, indicating which datasets have nested entity encoding, as well as other relevant details.
OK, thanks - let me know if you need any input or help figuring out the GUM documentation. The Coptic dataset also has nested (N)NER, in the same conllu tabs+brackets format.
I just ran into this list - thanks for putting it up. I curate the GUM corpus included in the data folder, but it seems to be a rather old version. We now have much more data, including four more genres and bringing up the total word count to about 130,000 tokens annotated for nested, (non-)named entities. Would you like to update the data to include the latest version?