giellalt / bugzilla-dummy

0 stars 0 forks source link

Empty symbol analysed as "´" (02BC MODIFYER LETTER APOSTROPHE) (Bugzilla Bug 2674) #946

Closed albbas closed 3 years ago

albbas commented 4 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 2674

Date: 2020-09-03T11:05:09+02:00 From: Trond Trosterud <> To: Sjur Nørstebø Moshagen <> CC: borre.gaup, chiara.argese, jeremy.bradley, rueter.jack, trond.trosterud, unhammer+apertium

Last updated: 2021-10-27T22:32:40+02:00

albbas commented 4 years ago

Comment 13974

Date: 2020-09-03 11:05:09 +0200 From: Trond Trosterud <>

Synopsis: The problem is that an empty character (actually: every character SPACE) is analysed as MODIFYER LETTER APOSTROPHE,

Input is: Йомак ¶ Туш то ¶

Command for analysis is: ccat -l mhr ~/rusbound/converted/mhr/ficti/|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

Output is:

"<Йомак>" "йомак" N Attr "йомак" N Sg Nom : "<¶>" "¶" CLB :\n "<>" "ʼ" N Symbol "<Туш>" "ту" Hom2 N Sg Ill "ту" Hom3 N Sg Ill "туш" Adv "туш" Hom2 N Attr "туш" Hom2 N Sg Nom "туш" Hom3 N Attr "туш" Hom3 N Sg Nom "туш" Pron Pron Dem : "<то>" "то" CC CC"+WORK" "то" Pron Pron Ind : "<¶>" "¶" CLB :\n "<>" "ʼ" N Symbol

albbas commented 4 years ago

Comment 13975

Date: 2020-09-03 11:08:42 +0200 From: Trond Trosterud <>

Correction: It does not happen for spaces betšeen šords. Here I get them before and after :\n, i.e. at the end of the sentence:

e "тиде книга." hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst "<тиде>" "тидаш" V ConNeg "тидаш" V Imprt Sg2 "тиде" Pron Dem Sg Nom

"<книга>" "книга" A "книга" A Der/MWN N Attr "книга" A Der/MWN N Sg Nom "книга" N Attr "книга" N Sg Nom "<.>" "." CLB "<>" "ʼ" N Symbol :\n "<>" "ʼ" N Symbol

It happens only for mhr.

albbas commented 3 years ago

Comment 14222

Date: 2021-10-27 22:32:40 +0200 From: Sjur Nørstebø Moshagen <>

The problem seems to have been fixed:

echo "тиде книга." hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst "<тиде>" "тидаш" V ConNeg "тидаш" V Imprt Sg2 "тиде" Pron Dem Sg Nom

"<книга>" "книга" A "книга" A Der/MWN N Attr "книга" A Der/MWN N Sg Nom "книга" N Attr "книга" N Sg Nom "<.>" "." CLB :\n