Closed dan-zeman closed 3 years ago
I actually assume that newdoc and newpar, if present, will precede sent_id, but I have not tested it.
Udapi sometimes does not preserve the original order of the sentence-level comments.
My first reaction was: yes, this is on purpose. The CoNLL-U specification says:
"tools compatible with the CoNLL-U format should carry these lines over into their output (unless specifically designed to process them in some way)".
There is currently no requirement on preserving the order nor any mention of global.columns
. Moreover, Udapi is specifically designed to process newdoc
, newpar
, sent_id
, text
and json_
and normalize the ordering of these standardized attributes according to the order provided in the examples in the specification (except for json_
which is not part of the CoNLL-U specification).
Udapi does not support CoNLL-U Plus yet.
My second reaction is: OK, let's fix this, but I see two ways:
1) Udapi will load global.columns to a special attribute and will make sure it is always printed as the first comment. This can be the first step towards supporting CoNLL-U Plus (perhaps with a special reader because read.Conllu
is heavily optimized for speed). The next step will be storing source_sent_id
in a special attribute and storing it after newpar
and before sent_id
.
2) Once the CoNLL-U specification is changed to clarify the requirements on the ordering of comment lines and whether tools must/should always preserve the original ordering or are allowed/required to normalize the ordering of the standardized columns, I can update Udapi implementation. I would perhaps keep sent_id
and the other attributes within tree.comments
, but I would update it in write.Conllu
if tree.sent_id
was changed.
Which way do you prefer, @dan-zeman?
There is currently no requirement on preserving the order nor any mention of global.columns.
This is true and actually it seems like a good idea to mention global.columns
there even though it is not part of the CoNLL-U specification. Ordering of the comments is a different matter. The CoNLL-U specification does not specify it and I'm not sure it should; but not changing their order when processing a CoNLL-U file is useful because it eliminates spurious differences when one checks the output.
My second reaction is: OK, let's fix this, but I see two ways
My preference lies closer to your 2. I would see keeping the input order of the comments as the default. Then I can imagine a method that would be called on demand (or it could be a block that one optionally inserts in their scenario) that would normalize the order of the comments. Either with a list of comment keywords provided by the caller, or with a default order (where the special comment types I'm currently aware of would be in the following order and everything else would come in the original order after them: global.columns, newdoc, newpar, sent_id, text).
Udapi sometimes does not preserve the original order of the sentence-level comments. Specifically, it makes
sent_id
the first,text
the second, and any unrecognized comments go after those two. (I actually assume thatnewdoc
andnewpar
, if present, will precedesent_id
, but I have not tested it.)While this reordering somehow makes sense in the general case (although I'd prefer an option to preserve the original order of all columns), there are comments that pertain to larger segments than sentences, and they should not be shifted after
sent_id
. Besides the document and paragraph boundaries mentioned above, this also involves the list of columns defined in the CoNLL-U Plus format:This must be (stay) the first line of a CoNLL-U Plus file. A valid CoNLL-U file becomes a valid CoNLL-U Plus file by making sure that it starts with this line; as a matter of fact, some existing UD treebanks have the line (e.g., UD_French-Spoken). Unfortunately, when such a file is modified by Udapi, the header line is shifted and the file's CoNLL-U Plus validity is canceled.