jonorthwash / ud-annotatrix

GNU General Public License v3.0
61 stars 49 forks source link

CG character ": " not parsed correctly #321

Open rueter opened 6 years ago

rueter commented 6 years ago

Annotatrix eats input:

"<Синь>"
        "синь" Pron Pers Pl3 Gen
        "синь" Pron Pers Pl3 Nom
: 
"<эсост>"
        "эса" Pron Ine PxPl3
        "эсост" Adv Ine PxPl3
: 
"<нинге>"
        "ни" Adv Temp Foc
        "нинге" Adv
: 
"<точкат>"
        "точка" N Pl Nom Indef
"<->"
        "-" PUNCT
"<кутт>"
        "куд" N Pl Nom Indef
"<,>"
        "," CLB
: 
"<Сай>"
        "самс" V IV Ind Prs ScSg3
        "самс" V IV V Act PrsPrc
        "самс" V IV V Der/NomAg N Sg Nom Indef
: 
"<пинге>"
        "пинге" N Sg Nom Indef
: 
"<—>"
        "—" PUNCT
: 
"<касыхть>"
        "касомс" V IV Ind Prs ScPl3
        "касы" N Pl Nom Indef
: 
"<ошсон>"
        "ош" N SP Ine PxSg1
"<.>"
        "." CLB
ftyers commented 6 years ago

@keggsmurph21

<spectie> https://github.com/jonorthwash/ud-annotatrix/issues/321
<spectie> why are there 
<spectie> :
<spectie> in the GT CG output ?
<TinoDidriksen> Those are literal spaces from hfst-tokenise --giella-cg mode.
<TinoDidriksen> It's ": " not just :
<Unhammer> «superblanks»
<Unhammer> or just blanks
<Unhammer> I don't understand the issue
<Unhammer> what did it eat?
<spectie> hmm
<spectie> well, whatever annotatrix is using can't cope with it 
<Unhammer> grep -v ^: ? or even sed 's/^:.*//'
<Flammie[m]> isn't basically any line not containing " a comment in CG
<Unhammer> weeel
<Unhammer> it should start with " or whitespace
<Unhammer> does it need a lemma?
<Unhammer> "<>"
<Unhammer>  tag
<Flammie[m]> might work
<Unhammer> also there's stuff like <setvariable> or something, but I'm guessing that's also not handled by annotatrix
<TinoDidriksen> Any CG parser should treat non-matching stuff as text.
<TinoDidriksen> Those : are perfectly valid.
<spectie> ok!
<spectie> thanks
<spectie> i'll pass it on
jonorthwash commented 6 years ago

Relevant points:

jonorthwash commented 6 years ago

This issue might be best filed against notatrix. @keggsmurph21, what do you think?

jonorthwash commented 6 years ago

The parser is in parser.js somewhere. This functionality should be fairly straightforward to add?

keggsmurph21 commented 6 years ago

what does "treated as text" mean in this context?

jonorthwash commented 5 years ago

what does "treated as text" mean in this context?

@TinoDidriksen, could you clarify your statement on this some?

<TinoDidriksen> Any CG parser should treat non-matching stuff as text.

TinoDidriksen commented 5 years ago

The way CG-3 does it, is any non-CG input gets bundled up in a buffer attached to the immediately previous cohort, and then spit out again completely untouched once processing is finished.

If the cohort is moved, the attached text goes with it. If the cohort is deleted, the text is still output where the cohort would have been.

This lets CG-3 transparently pass along all sorts of markup.

jonorthwash commented 5 years ago

So in this case, each occurrence of "\n: " would be part of the cohort on the previous line, right?

TinoDidriksen commented 5 years ago

Yes.

With the important caveat that I really mean cohort, not reading. Cohorts own the non-CG parts. Readings do not, because readings have messy lives. So non-CG interspersed between readings will get bundled up as one lump sum, output after the owning cohort.

jonorthwash commented 5 years ago

Ah, right. I think I get how the parsing is supposed to work then. @keggsmurph21, does this make sense to you?

jonorthwash commented 5 years ago

A newline followed by : should be interpreted as part of the previous cohort when parsing CG format.