Open dan-zeman opened 11 months ago
Now I am not sure whether this issue belongs more here or to the umr-annotation-tool repo? @jinzhao3611
Nevertheless, I would prefer to see the file specification as part of the UMR guidelines (thus belonging here) rather than as a format supported by one particular tool.
I am wondering if there is any formal specification that each UMR file is supposed to follow. The guidelines in this repository give some idea (even if incomplete) about the sentence level graphs and document level graphs. But they do not say that there are four annotation blocks for each sentence (tokens, sentence graph, alignment, document graph), each block followed by an empty line, the last one by two empty lines etc.
I am writing a validation script for UMR and it would be probably easier to follow a specification (if it exists) than trying to guess from the data files what is allowed and what not.
BTW, the data in UMR release 1.0 seem to follow different conventions in different languages, also different from what the guidelines say, and occasionally they have issues that are clear bugs regardless specification (such as non-matching brackets).