jonorthwash / ud-annotatrix

GNU General Public License v3.0
63 stars 49 forks source link

Annotatrix does not clean conllu properly #323

Open ftyers opened 6 years ago

ftyers commented 6 years ago

When I try and paste in this conllu,

# sent_id = AnoshkinV_ValskenjGudok_1936:772
# text = ― А тонь мень тев киян мон?..
# text[eng] = ― And yours is what business who am I?..
1       ―       ―       PUNCT   PUNCT   _       3       punct   _       _
2       А       а       CCONJ   CC      _       3       cc      _       _
3-4     тонь    _       _       _       _       _       _       _       _
3       тон     тон     PRON    Pron|Pers|Sg2|Gen       Case=Gen|Number=Sing|Person=2|PronType=Prs      0       root    _       _       
4       ь       ь       AUX     Clitic=Cop|Prs|ScSg3    Number=Sing|Person=3|Tense=Pres|VerbType=Cop    3       cop     _       _       
5       мень    мень    ADJ     A|Interr|Der/GenAttr|A  Derivation=GenAttr|PronType=Int 6       amod    _       _
6       тев     тев     NOUN    N|Sg|Nom|Indef  Case=Nom|Definite=Ind|Number=Sing       3       nsubj   _       _
7-8     киян    _       _       _       _       _       _       _       _
7       ки      кие     PRON    Pron|Interr|Hum Animacy=Hum|PronType=Int        3       csubj   _       _       
8       ян      ь       AUX     Clitic=Cop|Prs|ScSg1    Number=Sing|Person=1|Tense=Pres|VerbType=Cop    7       cop     _       _       
9       мон     мон     PRON    Pron|Pers|Sg1|Nom       Case=Nom|Number=Sing|Person=1|PronType=Prs      7       nsubj   _       SpaceAfter=No
10      ?..     ?..     PUNCT   CLB     _       3       punct   _       _

It doesn't give any output. This worked before, try here.

jonorthwash commented 6 years ago

It gives the mysterious error «cannot locate token with serial index "amod"», but I believe this isn't parsing because it uses spaces instead of tabs. It should be fairly easy to have it support spaces instead of tabs, but for now a simple search-and-replace should fix it?

ftyers commented 6 years ago

@jonorthwash it's not quite so easy because you need to be careful of tokens with spaces in. Yes, a search and replace is always possible, but I implemented this feature so it's a bit frustrating that it got lost. A big use-case of annotatrix is copy/pasting between things (email, github issues, files etc.) where tabs can easily get lost.

jonorthwash commented 6 years ago

Yeah, I was wondering about tokens with spaces. How did you originally implement parsing (converting?) of these line? It's probably fairly easy to reimplement it.

jonorthwash commented 6 years ago

This issue might be best filed against notatrix. @keggsmurph21, what do you think?

jonorthwash commented 6 years ago

The parser is in parser.js somewhere; this should be fairly easy to work out where it goes.