ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
355 stars 74 forks source link

udpip --tag produces invalid CoNLL-U #15

Closed ptakopysk closed 7 years ago

ptakopysk commented 7 years ago

When udpipe is run without --parse, it sets the HEAD fields to _, which does not conform to the CONLL-U format spcification -- IMHO it should be set to 0.

foxik commented 7 years ago

No, underscore is allowed for HEAD and has different semantics than 0 -- while underscore denotes unspecified value, 0 denotes a ROOT as the HEAD.

Quoting from the linked format specification:

The fields must additionally meet the following constraints:

  • Fields must not be empty.
  • Fields other than FORM and LEMMA must not contain space characters.
  • Underscore (_) is used to denote unspecified values in all fields except ID. Note that no format-level distinction is made for the rare cases where the FORM or LEMMA is the literal underscore – processing in such cases is application-dependent. Further, in UD treebanks the UPOSTAG, HEAD, and DEPREL columns are not allowed to be left unspecified.
martinpopel commented 7 years ago

The specification says "Underscore (_) is used to denote unspecified values in all fields except ID." so I think underscore in the HEAD column is ok.

ptakopysk commented 7 years ago

OK, you're right, I haven't seen that part. Thanks for clarification.

foxik commented 7 years ago

Actually, Martin Popel and me are reponsible for putting that part there :-)