giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
https://giellalt.uit.no
GNU General Public License v3.0
6 stars 1 forks source link

No norm analysis for `Standing Rock-vuosttaldemiin`, wrong tokenisation #65

Open snomos opened 1 year ago

snomos commented 1 year ago

This is what I had expected:

Standing Rock-vuosttaldemiin    @U.Cap.Obl@Standing Rock+CmpNP/First+N+Prop+Sem/Plc@U.Cap.Obl@+Cmp/SgNom@P.CmpFrst.FALSE@@P.CmpPref.FALSE@@D.CmpLast.TRUE@@D.CmpNone.TRUE@@U.CmpNone.FALSE@@P.CmpOnly.TRUE@@C.CmpHyph@+Cmp/Hyph+Cmp#@P.Px.add@vuosttaldeapmi+N+CmpN/SgN+CmpN/SgNomLeft+CmpN/SgGenLeft+CmpN/PlGenLeft@C.NeedNoun@+Sg+Com@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@@D.SpellRlx.ON@@C.SpellRlx@@D.SpaceCmp.ON@@C.SpaceCmp@

And this is what hfst-tokenise gives:

echo Standing Rock-vuosttaldemiin | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst   
"<Standing Rock>"
    "Standing Rock" MWE N Prop Sem/Plc Err/Orth Sg Acc <W:0.0>
    "Standing Rock" MWE N Prop Sem/Plc Err/Orth Sg Gen <W:0.0>
    "Standing Rock" MWE N Prop Sem/Plc Sg Nom <W:0.0>
    "Standing Rock" MWE N Prop Sem/Sur Attr <W:0.0>
    "Standing Rock" MWE N Prop Sem/Sur Err/Orth Sg Acc <W:0.0>
    "Standing Rock" MWE N Prop Sem/Sur Err/Orth Sg Gen <W:0.0>
    "Standing Rock" MWE N Prop Sem/Sur Sg Nom <W:0.0>
"<->"
    "-" PUNCT <W:0.0>
"<vuosttaldemiin>"
    "vuosttaldeapmi" N Sem/Act Pl Loc <W:0.0>
    "vuosttaldeapmi" N Sem/Act Pl Loc Err/Orth <W:0.0>
    "vuosttaldeapmi" N Sem/Act Sg Com <W:0.0>
    "vuosttaldeapmi" N Sem/Act Sg Com Err/Orth <W:0.0>
    "vuosttaldit" Ex/V TV Gram/3syll Der/NomAct N Pl Loc <W:0.0>
    "vuosttaldit" Ex/V TV Gram/3syll Der/NomAct N Pl Loc Err/Orth <W:0.0>
    "vuosttaldit" Ex/V TV Gram/3syll Der/NomAct N Sg Com <W:0.0>
    "vuosttaldit" Ex/V TV Gram/3syll Der/NomAct N Sg Com Err/Orth <W:0.0>
    "vuosttaldit" V TV Gram/3syll Actio Com <W:0.0>
:\n

It should have been one token.

snomos commented 10 months ago

This is because we block compounds with MWE-tagged words:

tail -2 giella-core/fst-filters/block-mwe-compounds.regex
# Change the +MWE tag into a flag diacritic:
"@U.CmpNone.TRUE@" "+MWE" <- "+MWE" ;

And this filter is used in the grammar checker FST:

grep -rl 'block-mwe-compounds' giella-core/am-shared/*
giella-core/am-shared/src-filters-dir-include.am
giella-core/am-shared/src_disamb-include.am
giella-core/am-shared/src_gramcheck-include.am

Is this a problem, @lynnda-hill or @duomdaamaendra ?

Or is it ok to treat compounds with MWE's like above, as a sequence of three tokens (the MWE, a hyphen, and whatever comes after)?