giellalt / bugzilla-dummy

0 stars 0 forks source link

Pound symbol `#` is deleted by hfst-tokenise (Bugzilla Bug 2627) #935

Closed albbas closed 4 years ago

albbas commented 4 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 2627

Date: 2019-10-23T19:19:34+02:00 From: Robert Reynolds <> To: Sjur Nørstebø Moshagen <> CC: borre.gaup, lene.antonsen, linda.wiechetek, trond.trosterud, unhammer+apertium

Last updated: 2019-12-17T09:50:06+01:00

albbas commented 4 years ago

Comment 13769

Date: 2019-10-23 19:19:34 +0200 From: Robert Reynolds <>

In the following example, the token # is deleted by the tokeniser.

$ echo "# – это не слово." | hfst-tokenize $GTHOME/langs/rus/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst – это не слово .

albbas commented 4 years ago

Comment 13824

Date: 2019-12-17 09:50:06 +0100 From: Sjur Nørstebø Moshagen <>

Fixed in svn revs 186224 and 186225:

$ echo "# – это не слово." hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst "<#>" "#" N Symbol

"<–>" "–" PUNCT : "<это>" "этот" Det Dem Neu AnIn Sg Acc "этот" Det Dem Neu AnIn Sg Nom "этот" Pron Dem Neu AnIn Sg Acc "этот" Pron Dem Neu AnIn Sg Nom "это" Pcle "это" Pron Dem Neu Inan Sg Acc "это" Pron Dem Neu Inan Sg Nom : "<не>" "не" Pcle : "<слово>" "слово" N Neu Inan Sg Acc "слово" N Neu Inan Sg Nom "<.>" "." CLB :\n

And without the -g option:

$ echo "# – это не слово." | hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst # – это не слово .