Open carges opened 3 years ago
The sign is defined:
grep '' tools/tokenisers/*.pmscript
tools/tokenisers/tokeniser-disamb-gt-desc.pmscript:Define incondform Punct|{„}|{“}|{”}|{…}|{‚}|{‘}|{’}|{–}|{—}|{}|{_}|{<}|{>}|{«}|{»}|{@}|{'}|{‹}|{›}|{➤}|{•}|{} ;
tools/tokenisers/tokeniser-gramcheck-gt-desc.pmscript:Define incondform Punct|{„}|{“}|{”}|{…}|{‚}|{‘}|{’}|{–}|{—}|{}|{_}|{<}|{>}|{«}|{»}|{@}|{'}|{‹}|{›}|{➤}|{•} ;
but as incondform:
! Characters which have analyses in the lexicon, but can appear without spaces
! before/after, that is, with no context conditions, and adjacent to words:
! The symbol following {•} is U+FEFF.
Define incondform Punct|{„}|{“}|{”}|{…}|{‚}|{‘}|{’}|{–}|{—}|{-}|{_}|{<}|{>}|{«}|{»}|{@}|{'}|{‹}|{›}|{➤}|{•}|{ } ;
The definition "token" does not contain incondform:
Define token [ morphoword | unknownwordEmpty | incondword | Ins(urlword) ] EndTag(token);
where "incondword" is
Define incondword morphology & [ any* incondform:[?*] nonprintable* ] ; ! Ends in punctuation – no context condition
So, it seems this is carefully designed not to allow SHY (U+00AD) to be recognised when occuring alone, and it also seems that although this is usually the case, the setup is not robust enough to deal with the SHY that ran away from home. @snomos : We should thus consider to make the system more robust, by allowing stray SHY (considering possible drawbacks).
A quick fix seems to be to delete all SHY before preprocessing.
I can see in the terminal that the symbol doen't get any analysis, but if I try to copy the text from the terminal is not visible:
If I search the unicode, I get that is called a soft hyphen. Here is a screenshot of the terminal: