jonorthwash / ud-annotatrix

GNU General Public License v3.0
63 stars 49 forks source link

storing the CoNLL-U data that can be lost in conversion #72

Open maryszmary opened 7 years ago

maryszmary commented 7 years ago

My thoughts from #40.

I just worry some about data loss when converting back and forth is so easy.

There are 3 things that can be lost when converting from CoNLL-U to CG3:

  1. XPOSTAG (especially if there is a UPOSTAG). I don't know how to solve this problem well.
  2. Attribute names if FEATS. Now conversion doesn't remove them, but, as @ftyers said, If the features in conllu are feat=val pairs, then only the val should be shown in CG mode. But then the attribute names will be lost.
  3. There is no place for MISC in CG3 format. I don't know how these data can be represented in CG3. This lead me to the conclusion that when the corpus' native format is CoNLL-U, the interface should, when viewed in CG3, store a copy of the ĆoNLL-U sentence with all these data. The same actually should work when the sentence is viewed in plain text.
ftyers commented 6 years ago

Here is an example:

peek 2018-07-17 20-23

keggsmurph21 commented 6 years ago

The notatrix backend keeps track of this stuff, but it needs a bugfix for the above issues, and needs testing for other stuff.

keggsmurph21 commented 6 years ago

90 refers to the issues mentioned in the screenshot—the ones that need a bugfix.