Open snomos opened 1 year ago
This is because we block compounds with MWE-tagged words:
tail -2 giella-core/fst-filters/block-mwe-compounds.regex
# Change the +MWE tag into a flag diacritic:
"@U.CmpNone.TRUE@" "+MWE" <- "+MWE" ;
And this filter is used in the grammar checker FST:
grep -rl 'block-mwe-compounds' giella-core/am-shared/*
giella-core/am-shared/src-filters-dir-include.am
giella-core/am-shared/src_disamb-include.am
giella-core/am-shared/src_gramcheck-include.am
Is this a problem, @lynnda-hill or @duomdaamaendra ?
Or is it ok to treat compounds with MWE's like above, as a sequence of three tokens (the MWE, a hyphen, and whatever comes after)?
This is what I had expected:
And this is what
hfst-tokenise
gives:It should have been one token.