umr4nlp / umr-guidelines

9 stars 6 forks source link

File format specification? #21

Open dan-zeman opened 11 months ago

dan-zeman commented 11 months ago

I am wondering if there is any formal specification that each UMR file is supposed to follow. The guidelines in this repository give some idea (even if incomplete) about the sentence level graphs and document level graphs. But they do not say that there are four annotation blocks for each sentence (tokens, sentence graph, alignment, document graph), each block followed by an empty line, the last one by two empty lines etc.

I am writing a validation script for UMR and it would be probably easier to follow a specification (if it exists) than trying to guess from the data files what is allowed and what not.

BTW, the data in UMR release 1.0 seem to follow different conventions in different languages, also different from what the guidelines say, and occasionally they have issues that are clear bugs regardless specification (such as non-matching brackets).

dan-zeman commented 11 months ago

Now I am not sure whether this issue belongs more here or to the umr-annotation-tool repo? @jinzhao3611

Nevertheless, I would prefer to see the file specification as part of the UMR guidelines (thus belonging here) rather than as a format supported by one particular tool.