giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
https://giellalt.uit.no
GNU General Public License v3.0
6 stars 1 forks source link

Some but not all main-readings has wordform-tags - but AFAICS all have? #75

Closed snomos closed 11 months ago

snomos commented 11 months ago
echo Biret-Ingá čohkana biilii. | ./tools/grammarcheckers/modes/smegramrelease.mode   
WARNING: Line 10: Some but not all main-readings of "<čohkana>" had wordform-tags (not completely mwe-disambiguated?), not splitting.
"<Biret-Ingá>"
    "Ingá" N Prop Sem/Fem Sg Nom <W:0.0> <firstCohort> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @SUBJ>
        "Biret" N Prop Sem/Fem Cmp/SgNom Cmp/Hyph Cmp <W:0.0> <firstCohort>
: 
"<čohkana>"
    "na" Pcle <W:0.0> "<na>" @PCLE
        "čogas" A Sem/Dummytag Attr <W:0.0> "<čohka>"
    "na" Pcle <W:0.0> "<na>" @PCLE
        "čohkat" V TV Ind Prs Sg3 <W:0.0> "<čohka>"
    "na" Pcle <W:0.0> "<na>" @PCLE
        "čohkka" N Sem/Event_Plc-elevate Sg Acc <W:0.0> "<čohka>"
    "na" Pcle <W:0.0> "<na>" @PCLE
        "čohkka" N Sem/Event_Plc-elevate Sg Gen Allegro <W:0.0> "<čohka>"
    "na" Pcle <W:0.0> "<na>" @PCLE
        "čohkka" N Sem/Event_Plc-elevate Sg Gen <W:0.0> "<čohka>"
    "čohkánit" Err/Orth-a-á <mv> V <LO-Ill-Any> IV Gram/3syll Ind Prs Sg3 <W:0.0> @+FMAINV
    "čohkánit" <mv> V <LO-Ill-Any> IV Gram/3syll Ind Prs Sg3 Err/Spellrelax <W:0.0> @+FMAINV
: 
"<biilii>"
    "biila" N Sem/Veh Sg Ill <W:0.0> @<ADVL
"<.>"
    "." CLB <W:0.0> <LastCohort>
:\n

Which word form does not have a word form-tag? How can we avoid this warning? It blocks proper retokenisation/disambiguation of MWE strings.

@lynnda-hill in this case the Verb reading should be selected in the disambiguation.

snomos commented 11 months ago

Sjå òg diskusjon i tråden her

unhammer commented 11 months ago

Den feilmeldinga kjem frå mwe-split-steget; det er input til mwe-split som må vera disambiguert for å unngå feilmeldinga.

$ ls modes/smegramrelease*-mwe-split.mode
modes/smegramrelease4-mwe-split.mode
$ ls modes/smegramrelease3*
modes/smegramrelease3-cg.mode

så det burde vera steg 3 som gir output som er tokeniseringstvitydig.

Det steget køyrer mwe-dis.bin som kjem frå mwe-dis.cg3.

unhammer commented 11 months ago
$ echo Biret-Ingá čohkana biilii. | ./modes/smegramrelease3-cg.mode
"<Biret-Ingá>"
        "Ingá" N Prop Sem/Fem Attr <W:0.0> <firstCohort> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound>
                "Biret" N Prop Sem/Fem Cmp/SgNom Cmp/Hyph Cmp <W:0.0> <firstCohort>
        "Ingá" N Prop Sem/Fem Sg Nom <W:0.0> <firstCohort> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound>
                "Biret" N Prop Sem/Fem Cmp/SgNom Cmp/Hyph Cmp <W:0.0> <firstCohort>
        "Ingá" N Prop Sem/Fem Attr <W:0.0> <firstCohort> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound>
                "Biret" N Prop Sem/Fem Cmp/SgNom Err/Orth Cmp/Hyph Cmp <W:0.0> <firstCohort>
        "Ingá" N Prop Sem/Fem Sg Nom <W:0.0> <firstCohort> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound>
                "Biret" N Prop Sem/Fem Cmp/SgNom Err/Orth Cmp/Hyph Cmp <W:0.0> <firstCohort>
:
"<čohkana>"
        "na" Pcle <W:0.0> "<na>"
                "čogas" A Sem/Dummytag Attr <W:0.0> "<čohka>"
        "na" Pcle <W:0.0> "<na>"
                "čohkat" V TV Ind Prs Sg3 <W:0.0> "<čohka>"
        "na" Pcle <W:0.0> "<na>"
                "čohkka" N Sem/Event_Plc-elevate Sg Acc <W:0.0> "<čohka>"
        "na" Pcle <W:0.0> "<na>"
                "čohkka" N Sem/Event_Plc-elevate Sg Gen Allegro <W:0.0> "<čohka>"
        "na" Pcle <W:0.0> "<na>"
                "čohkka" N Sem/Event_Plc-elevate Sg Gen <W:0.0> "<čohka>"
        "čohkánit" Err/Orth-a-á V <LO-Ill-Any> IV Gram/3syll Ind Prs Sg3 <W:0.0>
        "čohkánit" V <LO-Ill-Any> IV Gram/3syll Ind Prs Sg3 Err/Spellrelax <W:0.0>
:
"<biilii>"
        "biila" N Sem/Veh Sg Ill <W:0.0>
        "biile" N Sem/Dummytag Sg Ill <W:0.0>
"<.>"
        "." CLB <W:0.0> <LastCohort>
:\n

her ser me at čohkana kan vera to token "<čohka>"+"<na>" eller eitt token "<čohkana>". Då veit ikkje mwe-split om det skal dela opp eller ikkje. Løysinga er å disambiguera dette i mwe-dis.cg3

snomos commented 11 months ago

Eg viser til kommentaren frå @leneantonsen i Zulip-diskusjonen lenka til lenger opp:

Hvorfor får dere analyse av čohkana med 'na' som Pcle? det er ikke en mulig analyse i analysatoren utenom gramchech

Så i dette tilfelle er den rette løysinga å fjerna den umoglege analysen frå fst-en. Men bra å vita kva som skal vera strategien for denne feilmeldinga elles.