giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
https://giellalt.uit.no
GNU General Public License v3.0
6 stars 1 forks source link

Soft hyphen doesn't get analysis #26

Open carges opened 3 years ago

carges commented 3 years ago

I can see in the terminal that the symbol doen't get any analysis, but if I try to copy the text from the terminal is not visible:

echo "­luid" | hfst-tokenise --print-all --giella-cg --no-weights --unique tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | vislcg3 --grammar tools/tokenisers/mwe-dis.bin | cg-mwesplit | vislcg3 --grammar src/cg3/disambiguator.bin | vislcg3 --grammar src/cg3/functions.cg3 | vislcg3 --grammar src/cg3/dependency.bin
:­
"<luid>"
    "luid" ? @X #1->0
:\n

If I search the unicode, I get that is called a soft hyphen. Here is a screenshot of the terminal:

Screenshot 2021-07-20 at 14 50 06
Trondtr commented 3 years ago

The sign is defined:

grep '­' tools/tokenisers/*.pmscript
tools/tokenisers/tokeniser-disamb-gt-desc.pmscript:Define incondform      Punct|{„}|{“}|{”}|{…}|{‚}|{‘}|{’}|{–}|{—}|{­}|{_}|{<}|{>}|{«}|{»}|{@}|{'}|{‹}|{›}|{➤}|{•}|{} ;
tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript:Define incondform      Punct|{„}|{“}|{”}|{…}|{‚}|{‘}|{’}|{–}|{—}|{­}|{_}|{<}|{>}|{«}|{»}|{@}|{'}|{‹}|{›}|{➤}|{•} ;

but as incondform:

! Characters which have analyses in the lexicon, but can appear without spaces
! before/after, that is, with no context conditions, and adjacent to words:
! The symbol following {•} is U+FEFF.
Define incondform      Punct|{„}|{“}|{”}|{…}|{‚}|{‘}|{’}|{–}|{—}|{-}|{_}|{<}|{>}|{«}|{»}|{@}|{'}|{‹}|{›}|{➤}|{•}|{ } ;

The definition "token" does not contain incondform:

Define token [ morphoword | unknownwordEmpty | incondword | Ins(urlword) ] EndTag(token);

where "incondword" is

Define incondword       morphology & [ any* incondform:[?*] nonprintable* ] ; ! Ends in punctuation – no context condition

So, it seems this is carefully designed not to allow SHY (U+00AD) to be recognised when occuring alone, and it also seems that although this is usually the case, the setup is not robust enough to deal with the SHY that ran away from home. @snomos : We should thus consider to make the system more robust, by allowing stray SHY (considering possible drawbacks).

A quick fix seems to be to delete all SHY before preprocessing.