udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
57 stars 31 forks source link

The global.columns comment should not be moved after other comments #80

Closed dan-zeman closed 3 years ago

dan-zeman commented 3 years ago

Udapi sometimes does not preserve the original order of the sentence-level comments. Specifically, it makes sent_id the first, text the second, and any unrecognized comments go after those two. (I actually assume that newdoc and newpar, if present, will precede sent_id, but I have not tested it.)

While this reordering somehow makes sense in the general case (although I'd prefer an option to preserve the original order of all columns), there are comments that pertain to larger segments than sentences, and they should not be shifted after sent_id. Besides the document and paragraph boundaries mentioned above, this also involves the list of columns defined in the CoNLL-U Plus format:

# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC

This must be (stay) the first line of a CoNLL-U Plus file. A valid CoNLL-U file becomes a valid CoNLL-U Plus file by making sure that it starts with this line; as a matter of fact, some existing UD treebanks have the line (e.g., UD_French-Spoken). Unfortunately, when such a file is modified by Udapi, the header line is shifted and the file's CoNLL-U Plus validity is canceled.

martinpopel commented 3 years ago

I actually assume that newdoc and newpar, if present, will precede sent_id, but I have not tested it.

Yes

Udapi sometimes does not preserve the original order of the sentence-level comments.

My first reaction was: yes, this is on purpose. The CoNLL-U specification says: "tools compatible with the CoNLL-U format should carry these lines over into their output (unless specifically designed to process them in some way)". There is currently no requirement on preserving the order nor any mention of global.columns. Moreover, Udapi is specifically designed to process newdoc, newpar, sent_id, text and json_ and normalize the ordering of these standardized attributes according to the order provided in the examples in the specification (except for json_ which is not part of the CoNLL-U specification). Udapi does not support CoNLL-U Plus yet.

My second reaction is: OK, let's fix this, but I see two ways:

1) Udapi will load global.columns to a special attribute and will make sure it is always printed as the first comment. This can be the first step towards supporting CoNLL-U Plus (perhaps with a special reader because read.Conllu is heavily optimized for speed). The next step will be storing source_sent_id in a special attribute and storing it after newpar and before sent_id. 2) Once the CoNLL-U specification is changed to clarify the requirements on the ordering of comment lines and whether tools must/should always preserve the original ordering or are allowed/required to normalize the ordering of the standardized columns, I can update Udapi implementation. I would perhaps keep sent_id and the other attributes within tree.comments, but I would update it in write.Conllu if tree.sent_id was changed.

Which way do you prefer, @dan-zeman?

dan-zeman commented 3 years ago

There is currently no requirement on preserving the order nor any mention of global.columns.

This is true and actually it seems like a good idea to mention global.columns there even though it is not part of the CoNLL-U specification. Ordering of the comments is a different matter. The CoNLL-U specification does not specify it and I'm not sure it should; but not changing their order when processing a CoNLL-U file is useful because it eliminates spurious differences when one checks the output.

My second reaction is: OK, let's fix this, but I see two ways

My preference lies closer to your 2. I would see keeping the input order of the comments as the default. Then I can imagine a method that would be called on demand (or it could be a block that one optionally inserts in their scenario) that would normalize the order of the comments. Either with a list of comment keywords provided by the caller, or with a default order (where the special comment types I'm currently aware of would be in the following order and everything else would come in the original order after them: global.columns, newdoc, newpar, sent_id, text).