Evaluation script that unpacks lextag into remaining STREUSLE columns

nschneid commented 5 years ago

Re: #40, we need a script that takes lextags (full tags, one per token) output by a system and parses them to extract MWE groupings.

Lextags are the 19th and final column in the .conllulex format. Columns 1-10 are UD. Columns 11-18 can be filled in based on UD+lextags.

nschneid commented 5 years ago

Input: .conllulex format except columns 11-18 are blank (not underscores; completely blank)

I think the easiest way to implement this will be to adapt streuseval.py so that instead of VERIFYING that lextags are consistent with columns 11-18, it parses lextags and then populates columns 11-18 in JSON.

Specifically, it needs to:

parse each lextag into mwetag + lexcat + supersenses
parse mwetag sequences into links
form strong and weak groups (token sets) out of links
number the groups (first strong, then weak) and the tokens within the groups
look up lemmas for the tokens in each group

If we want the output as .conllulex, converting JSON to .conllulex could be a separate script.

nschneid commented 5 years ago

@danielhers I believe I have this working on the lextag-unpack branch. When reconstructing from the gold lextags I can't 100% match the original data file due to an arbitrary numbering issue (#42), but the streuseval score of the original vs. reconstructed is 100%, so there should not be any errors in the reconstruction. Hopefully this means the script is bug-free.

nert-nlp / streusle

Evaluation script that unpacks lextag into remaining STREUSLE columns #41