Date-time module flips tokens in rare case in the XML format

gucorpling / amalgum

English web corpus with 4M tokens and several annotation types

26 stars 6 forks source link

Date-time module flips tokens in rare case in the XML format #14

Closed amir-zeldes closed 3 years ago

amir-zeldes commented 3 years ago

@nitinvwaran it looks like in the latest Amalgum run there was one document in which XML token order got mixed up at the date-time step compared to the original data and all other formats. The position is here:

https://github.com/gucorpling/amalgum/blob/dev/amalgum/bio/xml/AMALGUM_bio_editions.xml#L926-L932

Do you know what might be making the system do this? This seems to be the only such case in the entire corpus where tokens don't match across formats, so I think it must be something peculiar to this combination of date expressions.

nitinvwaran commented 3 years ago

Difficult to investigate this without debugging the code, but looks like AMALGUM has changed a bit, and i'm getting a stanza/stanfordnlp error while trying to run the pipeline (the error I get is: Pretrained file exists but cannot be loaded from pos-dependencies/en_gum.pretrain.pt, due to the following exception: 'PretrainedWordVocab' object has no attribute 'items')

Would you know if a sample of the .conllu and the .xml files from the 04_DepParser folder are available for this file? If so, i might be able to use those to debug directly. I think the file from the out folder is called autogum_bio_doc250.xml

amir-zeldes commented 3 years ago

Hmm, that looks like either a version incompatibility in the dependencies themselves (stanza in this case) or stale models from an older version of the pipeline. But if you just need the files, I do have the previous steps locally. I can put them here:

https://corpling.uis.georgetown.edu/amir/download/amalgum/04_DepParser/

nitinvwaran commented 3 years ago

This was a bug - there is a date at the beginning of the sentence "In 1754", which overlaps with the "January 1754" phrase text towards the end of the sentence (which gets its own TIMEX3 date tag of 1754-01). The date tag added in the reversed order actually belongs to the first date, which is why the order seemed reversed.

Put in a fix and regressed on the file, and the newly generated file attached. Apart from the expected changes, no other dates affected. The initial date 1754 is now marked. The date "January 1754" is not marked, and that is an existing issue with the module: it's hard to build the xml in correct order when there are multiple dates in an element inside an [s] (here, two dates - Oct 1753 and Jan 1754 - in the [hi] element), and this case was left out during the initial development.

AMALGUM_bio_editions_v2.zip

amir-zeldes commented 3 years ago

Great, thanks! I wouldn't worry about this too much, this case was very rare, and the upcoming .conllu merged version of the corpus will only contain the fixed data in the conll format.