giellalt / lang-vro

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Võro language
https://giellalt.uit.no
GNU Lesser General Public License v3.0
3 stars 0 forks source link

The vro tokeniser-disamb-gt-desc.pmhfst has problem with UTF-8 combination t AND U+0301 in lemma readout (Bugzilla Bug 2647) #1

Closed albbas closed 2 months ago

albbas commented 4 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 2647

Date: 2020-02-21T16:31:08+01:00 From: Jack Rueter <> To: Sjur Nørstebø Moshagen <> CC: trond.trosterud

Last updated: 2020-02-21T16:31:08+01:00

albbas commented 4 years ago

Comment 13845

Date: 2020-02-21 16:31:08 +0100 From: Jack Rueter <>

Created attachment 229 png of tokeniser output for vro text with lemma containing U+0301

cd main/langs/vro

head config.log $ ./configure --with-hfst --without-xfst --enable-tokenisers --enable-reversed-intersect --enable-spellers --enable-alignment --enable-apertium --enable-dicts --enable-morpher --with-giella-shared=/Users/rueter/main/giella-shared --with-giella-core=/Users/rueter/main/giella-core GIELLA_CORE=/Users/rueter/main/giella-core/dir GTCORE=/Users/rueter/./main/giella-core GIELLA_SHARED=/Users/rueter/main/giella-shared/dir

echo 'mitte' | hfst-tokenise --giella-cg -W $GTHOME/langs/vro/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |less

"" "mit"t́ N Pl Gen "mit"t́ N Pl Ill "mit"t́ N Pl Par "mi"t́"mä" V Act Ind Prt Sg3 :\n

In lemma-final position, the t AND U+0301 combination are left outside of the lemma, see "mit"t́

In non-final position, subsequent lemma material is quoted, see "mi"t́"mä"

Attached file: vro-tokeniser-problem-2020-02-22.png (image/png, 149077 bytes) Description: png of tokeniser output for vro text with lemma containing U+0301

flammie commented 2 months ago

works today:

$ echo 'mitte' | hfst-tokenise --giella-cg -W tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<mitte>"
    "mitt́" N Pl Gen
    "mitt́" N Pl Ill
    "mitt́" N Pl Par
    "mit́mä" V Act Ind Prt Sg3
:\n