GUM version - Githubissues

amir-zeldes commented 4 years ago

I just ran into this list - thanks for putting it up. I curate the GUM corpus included in the data folder, but it seems to be a rather old version. We now have much more data, including four more genres and bringing up the total word count to about 130,000 tokens annotated for nested, (non-)named entities. Would you like to update the data to include the latest version?

juand-r commented 4 years ago

Thanks for the suggestion.

I can add the newest version of GUM to the table in the README, as well as a copy in the data folder (but in a different format than the older GUM, since I used the CoNLL 2003 format before).

Also, I was thinking I could leave the old version up... in case people need to compare results with work done using that version.

amir-zeldes commented 4 years ago

OK - I had a quick look at the data to see the format you're using, and I noticed a few issues with the data that might cause problems:

The CoNLL 2003 format has just one level of 'flat' BIO encoding, but GUM has nested (N)NER, meaning the nested entities are missing. For example, 'video gamers' should be labeled as person within 'teams of video gamers' (which are organization):

teams   B-organization
of  I-organization
video   I-organization
gamers  I-organization

GUM's native formats do encode the nesting, so you could just use the original files, but if you want to represent this using BIO encoding and just one set of tags (i.e. no B-lv1-organization, B-lv2-...), you could consider using the format used in LitBank, with multiple BIO columns: https://github.com/dbamman/litbank/blob/master/entities/tsv/105_persuasion_brat.tsv
A separate problem is the splits and sentence orders:
- Sentences seem to be shuffled, so systems couldn't use information from the previous/next sentence, which may be desirable (e.g. document level Bert models). This is especially important since GUM includes entity types for pronouns too, which often can't be resolved with just the current sentence.
- Sentences from the same documents are in train and test - this means that a model can appear to work really well since it knows "Vava'u" is a place in test. But this relatively rare place name is only recognized correctly because train happens to contain "Vava'u" too, which is probably unrealistically good if applied to unseen data.

GUM has established file splits, which you can find here: https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/file-lists

These are the same splits used in the conll shared task on UD parsing, so I'd recommend using the same splits for NER too.

juand-r commented 4 years ago

Thanks for the LitBank reference. I agree on the benefits of both nested NER annotation, and on using the surrounding context of sentences (I was only training at the individual sentence level and using BIO annotations when I started this, but am glad people are moving beyond that).

I was not aware that GUM had trail/test/dev splits -- thanks for pointing that out. I'll use the established file splits.

I was also thinking of structuring this a bit better, indicating which datasets have nested entity encoding, as well as other relevant details.

amir-zeldes commented 4 years ago

OK, thanks - let me know if you need any input or help figuring out the GUM documentation. The Coptic dataset also has nested (N)NER, in the same conllu tabs+brackets format.

juand-r / entity-recognition-datasets

GUM version #11