giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
https://giellalt.uit.no
GNU General Public License v3.0
6 stars 1 forks source link

a compound is split into two cohorts without any reason #55

Closed lynnda-hill closed 10 months ago

lynnda-hill commented 2 years ago

Olgoriikkalaš Ruoŧas gii lea gullevaš muhtin EO riikii, dahje Norgii, Islandii dahje Liechtensteinii, beassá vuoddjit muohtaskuhteriin Ruoŧas, jus sus lea vuoddjilohpi su ruovtturiikas.

look at the last word:

"<ruovttu>"
"ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> #27->27
"<riikas>"
"riika" N Err/Orth Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> ADD:2156 ADD:2156 @<ADVL MAP:23268 &typo #28->28 ADD:10143:Err/Orth-any
typo
"riika" N Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> ADD:2156 ADD:2156 @<ADVL MAP:23268 &typo &SUGGEST #28->28 ADD:10143:Err/Orth-any COPY:10152:Err/Orth-any
riika+N+Sg+Loc riikkas,riikkas,riikkas
; "riika" N Err/Orth Sem/Org Sg Acc PxSg3 <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> ADD:2156 ADD:2156 REMOVE:18827:r2392
; "riika" N Err/Orth Sem/Org Sg Gen PxSg3 <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> ADD:2156 ADD:2156 REMOVE:19832:r2615
"<.>"
snomos commented 1 year ago

Her er det eg får:

 echo Olgoriikkalaš Ruoŧas gii lea gullevaš muhtin EO riikii, dahje Norgii, Islandii dahje \
Liechtensteinii, beassá vuoddjit muohtaskuhteriin Ruoŧas, jus sus lea vuoddjilohpi su ruovtturiikas. \
| ./tools/grammarcheckers/modes/smegramrelease.mode
WARNING: Line 0: Some but not all main-readings of "<Olgoriikkalaš>" had wordform-tags (not completely mwe-disambiguated?), not splitting.
divvun-suggest: WARNING: Broken MWE wordform in analyses: riikkalaš
divvun-suggest: WARNING: Broken MWE wordform in analyses: Olgo
divvun-suggest: WARNING: Broken MWE wordform in analyses: riikkalaš
divvun-suggest: WARNING: Broken MWE wordform in analyses: Olgo

ogso denne analysen, som er den same som Linda får:

: 
"<ruovttu>"
    "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> #27->27
"<riikas>"
    "riika" N Err/Orth Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<ADVL &typo #28->28
typo
    "riika" N Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<ADVL &typo &SUGGEST #28->28
riika+N+Sg+Loc  riikkas,riikkas,riikkas
"<.>"
    "." CLB <W:0.0> <LastCohort> #29->29
:\n
flammie commented 1 year ago

Slik ser det ut for mä:

echo Olgoriikkalaš Ruoŧas gii lea gullevaš muhtin EO riikii, dahje Norgii, Islandii dahje \
Liechtensteinii, beassá vuoddjit muohtaskuhteriin Ruoŧas, jus sus lea vuoddjilohpi su ruovtturiikas. \
| ./tools/grammarcheckers/modes/smegramrelease.mode
"<Olgoriikkalaš>"
    "olgoriika" Ex/N Sem/Plc Der/lasj A Attr <W:0.0> <firstCohort> @>N #1->1
    "olgoriika" Ex/N Sem/Plc Der/lasj A Sg Nom <W:0.0> <firstCohort> #1->1
    "olgoriikalaš" A Err/Orth Sem/Hum Attr <W:0.0> <firstCohort> @>N #1->1
    "olgoriikalaš" A Err/Orth Sem/Hum Sg Nom <W:0.0> <firstCohort> #1->1
    "olgoriikalaš" N Sem/Hum Err/Orth Sg Nom <W:0.0> <firstCohort> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @SUBJ> #1->1
    "riika" Ex/N Sem/Org Der/lasj A Attr <W:0.0> <firstCohort> @>N #1->1
        "olgu" N Sem/Plc Cmp/SgNom Cmp <W:0.0> <firstCohort> #1->1
    "riika" Ex/N Sem/Org Der/lasj A Sg Nom <W:0.0> <firstCohort> #1->1
        "olgu" N Sem/Plc Cmp/SgNom Cmp <W:0.0> <firstCohort> #1->1
: 
"<Ruoŧas>"
    "Ruoŧŧa" §LO N Prop Sem/Plc Sg Loc <W:0.0> #2->4
: 
"<gii>"
    "gii" Pron Sem/Hum Rel Sg Nom <W:0.0> @SUBJ> #3->3
: 
"<lea>"
    "leat" <mv> V <copula> <TH-Nom-Any> <mielde> <OR-Loc-HumGroup> <OR-eret-Plc> <dušše><TH-Inf> <árvvus> <LO-Loc-johtu><DE-Ill-Plc> <AT-Loc-Mat> <AT-Abe-Any> <AT-Nom-Any> <AT-Nom-Adj><EX-Ill-Ani> <PO-Loc-Hum> <PO-Gen-Hum> <MA-mielde-Any> <MA-Adv-Manner> <XT-Gen-Measr> <LO-maŋŋil-Time> <LO-Acc-Time> <LO-Loc-Time> <CO-Com-Ani> <ID-Nom-Any> <TH-Nom-Any><RO-Ess-Any><EX-Ill-Any> <EX-Ill-Ani><TH-Nom-Adj> <EX-Ill-Ani> <TH-Nom-Obj><RE-Ill-Ani> <LO-Loc-Any> <AktioEss> <BE-Ill-Ani><PU-Ess-Any> <RO-Ess-Any><PU-Ill-Act> <RO-Ess-Any> <Inf> IV Ind Prs Sg3 <W:0.0> @+FMAINV #4->4
: 
"<gullevaš>"
    "gullevaš" A <TH-Ill-Any> Sem/Dummytag Attr <W:0.0> @>N #5->5
: 
"<muhtin>"
    "muhtin" Pron Indef Attr <W:0.0> @>N #6->6
    "muhtin" Pron Indef Err/Orth Attr <W:0.0> @>N #6->6
: 
"<EO riikii>"
    "EO-riika" N Sem/Org Sg Ill Err/SpaceCmp <W:0.0> @<ADVL &msyn-compound #7->7
msyn-compound
    "EO-riika" N Sem/Org Sg Ill <W:0.0> @<ADVL &SUGGEST #7->7
EO-riika+N+Sg+Ill   EO-riikii,EO-riikii,EO-riikii
"<,>"
    "," CLB <W:0.0> #8->8
: 
"<dahje>"
    "dahje" CC <W:0.0> @CNP #9->9
: 
"<Norgii>"
    "Norga" N Prop Sem/Plc Sg Ill <W:0.0> @ADVL> #10->10
"<,>"
    "," CLB <W:0.0> #11->11
: 
"<Islandii>"
    "Island" N Prop Sem/Plc Err/Lex Sg Ill <W:0.0> <ctjHead> <ctjHead> <ctjHead> @<ADVL &typo &SUGGEST #12->12
Island+N+Prop+Err/Lex+Sg+Ill    ?
    "Island" N Prop Sem/Sur Err/Lex Sg Ill <W:0.0> <ctjHead> <ctjHead> <ctjHead> @<ADVL &typo &SUGGEST #12->12
Island+N+Prop+Err/Lex+Sg+Ill    ?
    "Islánda" Err/Orth-a-á N Prop Sem/Plc Sg Ill <W:0.0> <ctjHead> <ctjHead> <ctjHead> @<ADVL &typo #12->12
typo
    "Islánda" N Prop Sem/Plc Sg Ill <W:0.0> <ctjHead> <ctjHead> <ctjHead> @<ADVL &typo &SUGGEST #12->12
Islánda+N+Prop+Sg+Ill   Islándii
: 
"<dahje>"
    "dahje" CC <W:0.0> @CNP #13->13
: 
"<Liechtensteinii>"
    "Liechtenstein" N Prop Sem/Plc Sg Ill <W:0.0> @<ADVL #14->14
    "Liechtenstein" N Prop Sem/Sur Sg Ill <W:0.0> @<ADVL #14->14
"<,>"
    "," CLB <W:0.0> #15->15
: 
"<beassá>"
    "beassat" <mv> V <ala-V> <EX-Nom-Ani> <mielde> <eret> <rasta> <badjel> <birra> <sisa> <mátkái> <mátkái><DE-Ill-Plc> <johtui><DE-Ill-Plc> <johtui> <IN-Com-Veh> <XT-Acc-Measure> <SO-luhtte-Ani> <DE-Ill-Plc> <DE-sisa-Build> <DE-lusa-Ani> <PT-Gen-Plc><DE-Ill-Any> <PT-Gen-Plc> <PT-rastá-Plc> <PT-meaddel-Plc> <PT-čađa-Plc> <PT-bokte-Plc> <SO-Loc-*Ani><DE-Ill-*Ani> <SO-Loc-*Ani> <CO-mielde-Ani> <RO-Ess-Any> <Inf> IV Ind Prs Sg3 <W:0.0> @+FMAINV #16->16
: 
"<vuoddjit>"
    "vuodjit" Ex/V TV Der/NomAg N Sem/Hum Pl Nom <W:0.0> @<SUBJ &real-ImprtPl2-Inf #17->17
real-ImprtPl2-Inf
    "vuodjit" Sem/Hum <W:0.0> @<SUBJ V TV Inf &SUGGEST #17->17
vuodjit+V+TV+Inf    vuodjit
: 
"<muohtaskuhteriin>"
    "muohtaskuhter" N Sem/Veh Pl Loc <W:0.0> <cohort-with-dynamic-compound> @<ADVL #18->18
: 
"<Ruoŧas>"
    "Ruoŧŧa" Err/Orth N Prop Sem/Plc Sg Loc <W:0.0> @<ADVL #19->19
    "Ruoŧŧa" N Prop Sem/Plc Sg Loc <W:0.0> @<ADVL #19->19
"<,>"
    "," CLB <W:0.0> #20->20
: 
"<jus>"
    "jus" CS <W:0.0> @CVP #21->21
: 
"<sus>"
    "son" Pron Sem/Hum Pers Sg3 Loc <W:0.0> @HAB #22->22
: 
"<lea>"
    "leat" <mv> V <copula> <TH-Nom-Any> <mielde> <OR-Loc-HumGroup> <OR-eret-Plc> <dušše><TH-Inf> <árvvus> <LO-Loc-johtu><DE-Ill-Plc> <AT-Loc-Mat> <AT-Abe-Any> <AT-Nom-Any> <AT-Nom-Adj><EX-Ill-Ani> <PO-Loc-Hum> <PO-Gen-Hum> <MA-mielde-Any> <MA-Adv-Manner> <XT-Gen-Measr> <LO-maŋŋil-Time> <LO-Acc-Time> <LO-Loc-Time> <CO-Com-Ani> <ID-Nom-Any> <TH-Nom-Any><RO-Ess-Any><EX-Ill-Any> <EX-Ill-Ani><TH-Nom-Adj> <EX-Ill-Ani> <TH-Nom-Obj><RE-Ill-Ani> <LO-Loc-Any> <AktioEss> <BE-Ill-Ani><PU-Ess-Any> <RO-Ess-Any><PU-Ill-Act> <RO-Ess-Any> <Inf> IV Ind Prs Sg3 <W:0.0> @FS-<ADVL #23->23
: 
"<vuoddjilohpi>"
    "lohpi" N <TH-Inf> Sem/Time Sg Nom <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<SPRED #24->24
        "vuoddji" A Sem/Hum Cmp/Attr Cmp <W:0.0> #24->24
    "lohpi" N <TH-Inf> Sem/Time Sg Nom <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<SPRED #24->24
        "vuoddji" A Sem/Hum Cmp/SgNom Cmp <W:0.0> #24->24
    "lohpi" N <TH-Inf> Sem/Time Sg Nom <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<SPRED #24->24
        "vuoddji" N NomAg Sem/Hum Cmp/SgGen Cmp <W:0.0> #24->24
    "lohpi" N <TH-Inf> Sem/Time Sg Nom <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<SPRED #24->24
        "vuoddji" N NomAg Sem/Hum Cmp/SgNom Cmp <W:0.0> #24->24
    "lohpi" N <TH-Inf> Sem/Time Sg Nom <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<SPRED #24->24
        "vuodjat" Ex/V IV Der/NomAg N Cmp/SgGen Cmp <W:0.0> #24->24
    "lohpi" N <TH-Inf> Sem/Time Sg Nom <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<SPRED #24->24
        "vuodjat" Ex/V IV Der/NomAg N Cmp/SgNom Cmp <W:0.0> #24->24
    "lohpi" N <TH-Inf> Sem/Time Sg Nom <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<SPRED #24->24
        "vuodjit" Ex/V TV Der/NomAg N Cmp/SgGen Cmp <W:0.0> #24->24
    "lohpi" N <TH-Inf> Sem/Time Sg Nom <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<SPRED #24->24
        "vuodjit" Ex/V TV Der/NomAg N Cmp/SgNom Cmp <W:0.0> #24->24
: 
"<su>"
    "son" Pron Sem/Hum Pers Sg3 Acc <W:0.0> @<OBJ #25->25
    "son" Pron Sem/Hum Pers Sg3 Gen <W:0.0> @>N #25->25
: 
"<ruovtturiikas>"
    "riika" N Err/Orth Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<ADVL &typo #26->26
        "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> #26->26
typo
    "riika" N Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<ADVL &typo &SUGGEST #26->26
        "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> #26->26
ruoktu+N+Cmp/SgGen+Cmp#riika+N+Sg+Loc   ruovttoriikkas,ruovttoriikkas,ruovttoriikkas,ruovtturiikkas,ruovtturiikkas,ruovtturiikkas
"<.>"
    "." CLB <W:0.0> <LastCohort> #27->27
:\n

er det den som den ska vare? Versjoner:

$ vislcg3 --version
VISL CG-3 Disambiguator version 1.3.9.13892
Copyright (C) 2007-2021 GrammarSoft ApS. Licensed under GPLv3+
$ divvun-suggest --version
divvun-suggest - Divvun gramcheck version 0.3.10

men libdivvun fra dagens git head.

snomos commented 1 year ago

Det ser rett ut for meg 👍

snomos commented 1 year ago

Hm, eg får framleis feil resultat, med nyaste nightly frå Tino, og nyaste koden av alt frå GiellaLT:

"<su>"
        "son" Pron Sem/Hum Pers Sg3 Acc <W:0.0> @<OBJ #26->26
        "son" Pron Sem/Hum Pers Sg3 Gen <W:0.0> @>N #26->26
: 
"<ruovttu>"
        "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> #27->27
"<riikas>"
        "riika" N Err/Orth Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<ADVL &typo #28->28
typo
        "riika" N Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<ADVL &typo &SUGGEST #28->28
riika+N+Sg+Loc  riikkas,riikkas,riikkas
"<.>"
        "." CLB <W:0.0> <LastCohort> #29->29
:\n
flammie commented 1 year ago
$ hfst-tokenise -V
hfst-tokenise 0.1 (hfst 3.16.0)
Copyright (C) 2017 University of Helsinki,
License GPLv3: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ divvun-blanktag -V
divvun-blanktag - Divvun gramcheck version 0.3.10
$ vislcg3 -V
VISL CG-3 Disambiguator version 1.3.9.13892
Copyright (C) 2007-2021 GrammarSoft ApS. Licensed under GPLv3+

Altså ä får den andra (rett?) resultat med denne kombinasjon av versjoner.

snomos commented 1 year ago

Her er versjonane mine:

hfst-tokenise -V
hfst-tokenise 0.1 (hfst 3.16.0)
Copyright (C) 2017 University of Helsinki,
License GPLv3: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

divvun-blanktag -V
divvun-blanktag - Divvun gramcheck version 0.3.10

vislcg3 -V
VISL CG-3 Disambiguator version 1.3.9.13892
Copyright (C) 2007-2021 GrammarSoft ApS. Licensed under GPLv3+

Det ser ut til å vera same som for @flammie (men versjonsnummera er ikkje spesielt detaljerte, med unnatak av for CG).

snomos commented 1 year ago

Av grunnar som eg ikkje forstår så får eg denne analysen frå hfst-tokenise:

echo ruovtturiikas | hfst-tokenise -g tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst  
"<ruovtturiikas>"
    "riika" N Err/Orth Sem/Org Sg Acc PxSg3 <W:0.0> "<riikas>"
        "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> "<ruovttu>"
    "riika" N Err/Orth Sem/Org Sg Gen PxSg3 <W:0.0> "<riikas>"
        "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> "<ruovttu>"
    "riika" N Err/Orth Sem/Org Sg Loc <W:0.0> "<riikas>"
        "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> "<ruovttu>"
:\n

Denne kohorten blir delt i to av cg-mwesplit, fordi ordformstaggane "<riikas>" og "<ruovttu>" finst i analysane. Utan desse ordformstaggane skulle ikkje kohorten bli delt i to. Spørsmålet er kvifor dei dukkar opp i det heile.

snomos commented 1 year ago

@flammie har ordformstaggane over noko med endringane dine for TTS å gjera?

flammie commented 1 year ago

@flammie har ordformstaggane over noko med endringane dine for TTS å gjera?

ja kanskje, men det borde ha värt midlertidig og bort allerede, ä ska sjekka men må kompilere igjen

flammie commented 1 year ago

det borde vara bra med nyaste hfst-tokenise (etter https://github.com/hfst/hfst/commit/297246a3f9dd339347732586c06fed048e6c382a ) eller sist stable release.

snomos commented 10 months ago

Denne ser ut til å vera fiksa no:

echo Olgoriikkalaš Ruoŧas gii lea gullevaš muhtin EO riikii, dahje Norgii, Islandii dahje \
Liechtensteinii, beassá vuoddjit muohtaskuhteriin Ruoŧas, jus sus lea vuoddjilohpi su ruovtturiikas. \
| ./tools/grammarcheckers/modes/smegramrelease.mode
[...]
: 
"<ruovtturiikas>"
    "riika" N Err/Orth Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<ADVL &typo #26->26
        "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> #26->26
typo
    "riika" N Sem/Org Sg Loc <W:0.0> <cohort-with-dynamic-compound> <cohort-with-dynamic-compound> @<ADVL &typo &SUGGEST #26->26
        "ruoktu" N Sem/Plc Cmp/SgGen Cmp <W:0.0> #26->26
ruoktu+N+Cmp/SgGen+Cmp#riika+N+Sg+Loc   ruovttoriikkas,ruovttoriikkas,ruovttoriikkas,ruovtturiikkas,ruovtturiikkas,ruovtturiikkas
"<.>"
    "." CLB <W:0.0> <LastCohort> #27->27