giellalt / lang-smj

Finite state and Constraint Grammar based analysers and proofing tools + language resources for Lule Sámi
https://giellalt.uit.no
GNU General Public License v3.0
2 stars 0 forks source link

North sámi testdata in SMJ #103

Open ilm024 opened 1 month ago

ilm024 commented 1 month ago

SMJ make check is failing:

FAIL: accept-all-lemmas.sh
============================================================================
Testsuite summary for Giella smj 0.2.0
============================================================================
# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See tools/spellcheckers/test/fstbased/desktop/hfst/test-suite.log
Please report to feedback@divvun.no

It seams like it north sámi test data in SMJ:

=====================================================================================
   Giella smj 0.2.0: tools/spellcheckers/test/fstbased/desktop/hfst/test-suite.log
=====================================================================================

# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: accept-all-lemmas.sh
==========================

"Áváhårsa" is NOT in the lexicon:
"Helmuk" is NOT in the lexicon:
"Kr.å" is NOT in the lexicon:
"Kr.m" is NOT in the lexicon:
"Mančuria" is NOT in the lexicon:
"MuVá" is NOT in the lexicon:
"Tearbmasymposia" is NOT in the lexicon:
"Vuottnánáhpe" is NOT in the lexicon:
"áhpeguollebivdár" is NOT in the lexicon:
"álggididdje" is NOT in the lexicon:
see rejected_lemmas.txt for more
FAIL accept-all-lemmas.sh (exit status: 1)
flammie commented 1 month ago

ok, so the words are:

"áhpeguollebivdár" is NOT in the lexicon:
"álggididdje" is NOT in the lexicon:
"almasjlasj" is NOT in the lexicon:
"Áváhårsa" is NOT in the lexicon:
"avtl." is NOT in the lexicon:
"bba" is NOT in the lexicon:
"boajto" is NOT in the lexicon:
"buojk" is NOT in the lexicon:
"Četčenia" is NOT in the lexicon:
"dárrolasj" is NOT in the lexicon:
"do" is NOT in the lexicon:
"dub" is NOT in the lexicon:
"dus" is NOT in the lexicon:
"ednamlasj" is NOT in the lexicon:
"fárrolasj" is NOT in the lexicon:
"fylkkasuohkanlasj" is NOT in the lexicon:
"færtguhti" is NOT in the lexicon:
"gájkkasasj" is NOT in the lexicon:
"gáktse" is NOT in the lexicon:
"gávo" is NOT in the lexicon:
"goabbák guojmme" is NOT in the lexicon:
"guhtik guojmme" is NOT in the lexicon:
"guoktajuodevidálågåk" is NOT in the lexicon:
"guoktajuohtevidálågåk" is NOT in the lexicon:
"guollebivdár" is NOT in the lexicon:
"gånågislasj" is NOT in the lexicon:
"háldaduslasj" is NOT in the lexicon:
"Helmuk" is NOT in the lexicon:
"huom" is NOT in the lexicon:
"iesjguhti" is NOT in the lexicon:
"jav" is NOT in the lexicon:
"j.d" is NOT in the lexicon:
"jd" is NOT in the lexicon:
"jdd" is NOT in the lexicon:
"j.d.s" is NOT in the lexicon:
"j.e" is NOT in the lexicon:
"je" is NOT in the lexicon:
"jed" is NOT in the lexicon:
"j.i" is NOT in the lexicon:
"j.n.v" is NOT in the lexicon:
"j.s" is NOT in the lexicon:
"jsg." is NOT in the lexicon:
"Kr.m" is NOT in the lexicon:
"Kr.å" is NOT in the lexicon:
"labun" is NOT in the lexicon:
"lájbbár" is NOT in the lexicon:
"låbdun" is NOT in the lexicon:
"lågenan" is NOT in the lexicon:
"lågenanvuostas" is NOT in the lexicon:
"låptun" is NOT in the lexicon:
"Mančuria" is NOT in the lexicon:
"materiáladahtes" is NOT in the lexicon:
"miljo" is NOT in the lexicon:
"MuVá" is NOT in the lexicon:
"måjo" is NOT in the lexicon:
"måtso" is NOT in the lexicon:
"niellje" is NOT in the lexicon:
"nubbe nubbe" is NOT in the lexicon:
"sadj" is NOT in the lexicon:
"sahtemus" is NOT in the lexicon:
"sebrudaklasj" is NOT in the lexicon:
"su" is NOT in the lexicon:
"suohkanlasj" is NOT in the lexicon:
"såbadimahtes" is NOT in the lexicon:
"Tearbmasymposia" is NOT in the lexicon:
"tjábbámus" is NOT in the lexicon:
"ulmusjlasj" is NOT in the lexicon:
"Vuottnánáhpe" is NOT in the lexicon:
"ålleslasj" is NOT in the lexicon:
"åss" is NOT in the lexicon:

this is the distribution in lexc files:

$ for w in $(cat tools/spellcheckers/test/fstbased/desktop/hfst/rejected_lemmas.txt | sed -e 's/^"//' -e 's/" is NOT.*//') ; do egrep "^$w\+" src/fst/morphology/stems/*; done
src/fst/morphology/stems/nouns.lexc:áhpeguollebivdár+N+Err/Der+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:áhpe#guolle#bivdár GAHPER ;
src/fst/morphology/stems/nouns.lexc:álggididdje+N+Err/Der+CmpN/SgN+CmpN/SgG+CmpN/PlG+CmpN/SgNomLeft+CmpN/SgGenLeft+CmpN/PlGenLeft+Sem/Hum:álggididdje ACTOR ; !No verb "álggidit", so álggididdje isn't possible
src/fst/morphology/stems/adjectives.lexc:almasjlasj+A+Err/Der+CmpN/SgN+CmpN/PlG:almasjl DÁRBULASJ ;
src/fst/morphology/stems/smj-propernouns.lexc:Áváhårsa+Use/-Spell:Ává^hårsa MARJA-plc ; !
src/fst/morphology/stems/smj-abbreviations.lexc:avtl.+N:avtalåhko ab-dot-noun-itrab ;
src/fst/morphology/stems/smj-abbreviations.lexc:bba+N:bårråmbassti ab-dot-noun-itrab ;!bårråmbassti
src/fst/morphology/stems/adjectives.lexc:boajto+A:boajto VINJO- ;
src/fst/morphology/stems/smj-abbreviations.lexc:buojk+Adv:buojk ab-dot-adv-trab ; ! buojkulvis/vissan
src/fst/morphology/stems/smj-propernouns.lexc:Četčenia+Use/-Spell+OLang/SME:Četčenia ACCRA-plc ;
src/fst/morphology/stems/nouns.lexc:dárrolasj+N+Err/Der+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:dárrol BERULASJ ; !should be dárulasj? It's more ok with dárrolasj than dárrulasj
src/fst/morphology/stems/smj-abbreviations.lexc:do+N:do ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:dub+Adv:dub ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:dus+Adv:dus ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/adjectives.lexc:ednamlasj+A+Err/Der+CmpN/SgN+CmpN/PlG:ednamladtj ÅLLAGASJ ;
src/fst/morphology/stems/adjectives.lexc:ednamlasj+A+Err/Der:ednamladtj ÅLLAGASJ ; !ulikestavleses subtsantiv får ikke -lasj derivasjon
src/fst/morphology/stems/nouns.lexc:fárrolasj+N+Err/Der+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:fárrol BERULASJ ; !should be fárulasj, but more ok with fárrolasj than fárrulasj
src/fst/morphology/stems/adjectives.lexc:fylkkasuohkanlasj+A+Err/Der:fylkka#suohkanl METÅVDÅLASJ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef:færtge%> guhtikobl ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Nom+Foc/Pos-k:fært#guhti%>k # ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Nom+Foc/Neg-k:fært#guhti%>k # ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Ine+Foc/Pos-k:fært#gænºna%>k # ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Ine+Foc/Neg-k:fært#gænºna%>k # ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Ine+Foc/Pos-k+Use/NG:fært#gænºna%>nik # ; ! 
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Ine+Foc/Neg-k+Use/NG:fært#gænºna%>nik # ; ! 
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Ela+Foc/Pos-k:fært#gæssta%>k # ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Ela+Foc/Neg-k:fært#gæssta%>k # ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Ela+Foc/Pos-k+Use/NG:fært#gæssta%>stik # ;   ! 
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Sg+Ela+Foc/Neg-k+Use/NG:fært#gæssta%>stik # ;   ! 
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Pl+Nom+Foc/Pos-k:fært#gudi%>k # ;
src/fst/morphology/stems/pronouns.lexc:færtguhti+Pron+Indef+Pl+Nom+Foc/Neg-k:fært#gudi%>k # ;
src/fst/morphology/stems/pronouns.lexc:gájkkasasj+Pron+Indef+Err/Orth:gájkkasa juohkkahasjcase ; !
src/fst/morphology/stems/numerals.lexc:gáktse+Err/Orth+Use/-Spell+Use/Marg+Use/NG:gáktse# NLX ; !Err/Orth?
src/fst/morphology/stems/adjectives.lexc:gávo+A:gávo VINJO- ;
src/fst/morphology/stems/nouns.lexc:guojmme+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:guojmme MUORRA ; ! 
src/fst/morphology/stems/nouns.lexc:guojmme+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:guojmme MUORRA ; ! 
src/fst/morphology/stems/numerals.lexc:guoktajuodevidálågåk+Num:guok#tjuode#vidá#lågåg9 VUOSTASJ ;
src/fst/morphology/stems/numerals.lexc:guoktajuohtevidálågåk+Num:guok#tjuohte#vidá#lågåg9 VUOSTASJ ;
src/fst/morphology/stems/nouns.lexc:guollebivdár+N+Err/Der+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:guolle#bivdár GAHPER ; ! used derivation for contraced verb, when verb is even, bivddet-bivdde
src/fst/morphology/stems/adjectives.lexc:gånågislasj+A+Err/Der:gånågisladtj ÅLLAGASJ ;
src/fst/morphology/stems/adjectives.lexc:háldaduslasj+A+Err/Der:háldadusladtj ÅLLAGASJ ; !feil
src/fst/morphology/stems/smj-propernouns.lexc:Helmuk+Use/-Spell:Helmug9 LONDON-plc ; !
src/fst/morphology/stems/smj-abbreviations.lexc:huom+Adv:huom ab-dot-adv-trab ; !huomaha
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef:iesj#ge%> guhtikobl ;
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Pl+Nom+Foc/Neg-k:iesj#gudi%>k # ;
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Pl+Nom+Foc/Pos-k:iesj#gudi%>k # ;
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Ela+Foc/Neg-k+Use/NG:iesj#gæstá%>stik # ;  ! 
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Ela+Foc/Pos-k+Use/NG:iesj#gæstá%>stik # ;  ! 
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Ela+Foc/Neg-k:iesj#gæssta%>k # ; 
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Ela+Foc/Pos-k:iesj#gæssta%>k # ; 
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Ine+Foc/Neg-k+Use/NG:iesj#gænºna%>nik # ; ! 
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Ine+Foc/Pos-k+Use/NG:iesj#gænºna%>nik # ; ! 
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Ine+Foc/Neg-k:iesj#gænºna%>k # ;
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Ine+Foc/Pos-k:iesj#gænºna%>k # ;
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Nom+Foc/Neg-k:iesj#guhti%>k # ;
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Sg+Nom+Foc/Pos-k:iesj#guhti%>k # ;
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Attr+Foc/Neg-k:iesj#guhti%>k # ; !OBS
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Attr+Foc/Pos-k:iesj#guhti%>k # ; !OBS
src/fst/morphology/stems/pronouns.lexc:iesjguhti+Pron+Indef+Attr:iesj#guhti%>k # ; ! double, harmonised with sme
src/fst/morphology/stems/smj-abbreviations.lexc:j.d+Adv:j.d ab-dot-adv-itrab ; !ja% dakkára
src/fst/morphology/stems/smj-abbreviations.lexc:jdd+Adv:jdd ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:jed+Adv:jed ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:jd+Adv:jd ab-dot-adv-itrab ;   !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:jdd+Adv:jdd ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/nouns.lexc:judos+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Dummytag:juhtos ÅRES ;
src/fst/morphology/stems/nouns.lexc:jådos+N+Sem/Act:jåhtos ÅRES ;
src/fst/morphology/stems/nouns.lexc:jådås+N+Sem/Dummytag:jåhtås GÁMAS ;
src/fst/morphology/stems/smj-abbreviations.lexc:j.d.s+Adv:j.d.s ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:j.e+Adv:j.e ab-dot-adv-itrab ; !hæ
src/fst/morphology/stems/smj-abbreviations.lexc:je+Adv:je ab-dot-adv-itrab ;   !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:jed+Adv:jed ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:j.i+Adv:j.i ab-dot-adv-itrab ; !ja ienep
src/fst/morphology/stems/smj-abbreviations.lexc:j.n.v+Adv:j.n.v ab-dot-adv-itrab ; !ja nav vijdábun !
src/fst/morphology/stems/smj-abbreviations.lexc:j.s+Adv:j.s ab-dot-adv-itrab ; !hæ?
src/fst/morphology/stems/smj-abbreviations.lexc:jsg.+N:julevsámegiella ab-dot-noun-itrab ;
src/fst/morphology/stems/smj-abbreviations.lexc:Kr.m+Adv+Sem/Time:Kr.m ab-dot-adv-itrab ;
src/fst/morphology/stems/smj-abbreviations.lexc:Kr.å+Adv+Sem/Time:Kr.å ab-dot-adv-itrab ;
src/fst/morphology/stems/nouns.lexc:labun+N+Sem/Dummytag+Err/Der:labun GAHPER ; ! bad der?
src/fst/morphology/stems/nouns.lexc:lájbbár+N+Err/Der+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:lájbbár GUOLLÁR ; ! !Feil, baker er "lájbbo", her har ordboksforfatterne gjort feil når de har laget en avledning
src/fst/morphology/stems/nouns.lexc:låbdun+N+Err/Der+Sem/Ctain:låbdun GAHPER ; ! contraced stems don't make NomInstr, no verb låbddot
src/fst/morphology/stems/numerals.lexc:lågenanvuostas+v1+A+Ord+Err/Orth:lågenan#vuostas VUOSTASJ ;
src/fst/morphology/stems/nouns.lexc:låptun+N+Err/Der+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Obj:bevkun GAHPER ; !låpptit can't make this derivation
src/fst/morphology/stems/nouns.lexc:låptun+N+Err/Der+Sem/Ctain:låptun GAHPER ; ! contraced stems don't make NomInstr, no verb låpptot
src/fst/morphology/stems/smj-propernouns.lexc:Mančuria+Use/-Spell+OLang/SME:Mančuria ACCRA-plc ;
src/fst/morphology/stems/adjectives.lexc:materiáladahtes+A+Err/Der:materi^álad DIEHTEMAHTES ;
src/fst/morphology/stems/smj-abbreviations.lexc:miljo+N:miljo ab-dot-num ; !millijåvnnå
src/fst/morphology/stems/smj-acronyms.lexc:MuVá+N+Prop+Sem/Org+ACR+Err/Orth:MuVá ACRO_cons ;  !  - propername according to čállinrávvagat
src/fst/morphology/stems/adjectives.lexc:måjo+A+CmpN/SgN+CmpN/PlG:måjo VINJO- ;
src/fst/morphology/stems/adjectives.lexc:måtso+A:måtso VINJO- ;
src/fst/morphology/stems/numerals.lexc:niellje+Err/Orth+Use/-Spell+Use/Marg+Use/NG:niellje# NLX ; !Err/Orth?
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord:nupp nubbecase ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord+Sg+Nom:nubbe%> K ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord+Ess:nubbe%>n K ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord+Sg+Ill:nubbá%>j K ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord+Cmp/SgGen:nuppe%> NUMERALCOMPOUNDS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Ess:nubbe%>n K-CONS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Par:nuppe%>t # ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Sg+Ill:nubbá%>j K-CONS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Sg+Nom:nubbe%> K-VOW ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Attr:nubbe%> K-VOW ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr:nupp nubbecase ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Ess:nubbe%>n K-CONS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Par:nuppe%>t # ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Sg+Ill:nubbá%>j K-CONS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Sg+Nom:nubbe%> K-VOW ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Attr:nuppe K-VOW ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef:nupp nubbecase ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord:nupp nubbecase ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord+Sg+Nom:nubbe%> K ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord+Ess:nubbe%>n K ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord+Sg+Ill:nubbá%>j K ;
src/fst/morphology/stems/numerals.lexc:nubbe+A+Ord+Cmp/SgGen:nuppe%> NUMERALCOMPOUNDS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Ess:nubbe%>n K-CONS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Par:nuppe%>t # ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Sg+Ill:nubbá%>j K-CONS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Sg+Nom:nubbe%> K-VOW ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr+Attr:nubbe%> K-VOW ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Recipr:nupp nubbecase ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Ess:nubbe%>n K-CONS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Par:nuppe%>t # ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Sg+Ill:nubbá%>j K-CONS ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Sg+Nom:nubbe%> K-VOW ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef+Attr:nuppe K-VOW ;
src/fst/morphology/stems/pronouns.lexc:nubbe+Pron+Indef:nupp nubbecase ;
src/fst/morphology/stems/smj-abbreviations.lexc:sadj+A:sadj ab-dot-adj-trab ; !sadjásasj
src/fst/morphology/stems/adjectives.lexc:sahtemus+A+Err/Der:sahte TJAVGGÁMUS ; !must be sademus (generated by sadep)
src/fst/morphology/stems/adjectives.lexc:sebrudaklasj+A+Err/Der:sebrudahkaladtj ÅLLAGASJ ;
src/fst/morphology/stems/adjectives.lexc:sebrudaklasj+A+Err/Der:sebrudakladtj ÅLLAGASJ ;
src/fst/morphology/stems/smj-abbreviations.lexc:su+Adv:su ab-dot-adv-numnoab ; ! La stå ! hæ?
src/fst/morphology/stems/adjectives.lexc:suohkanlasj+A+Err/Der:suohkanl METÅVDÅLASJ;
src/fst/morphology/stems/adjectives.lexc:såbadimahtes+A+Err/Der+CmpN/SgN+CmpN/PlG:såbadim DIEHTEMAHTES ;
src/fst/morphology/stems/smj-propernouns.lexc:Tearbmasymposia+Use/-Spell+OLang/SME:Tearbmasymposia ACCRA-obj ;
src/fst/morphology/stems/adjectives.lexc:tjábbámus+A+Err/Der:tjábbá TJAVGGÁMUS ; !must be tjáppámus (generated by tjáppep)
src/fst/morphology/stems/adjectives.lexc:ulmusjlasj+A+Err/Der+CmpN/SgN+CmpN/PlG:ulmusjl DÁRBULASJ ;
src/fst/morphology/stems/smj-propernouns.lexc:Vuottnánáhpe+Use/-Spell:Vuottnán^áhpe MARJA-plc ; !
src/fst/morphology/stems/adjectives.lexc:ålleslasj+A+Err/Der:ållesl DÁRBULASJ ;
src/fst/morphology/stems/smj-abbreviations.lexc:åss+N:åss    ab-dot-noun-itrab ; !åssudahka

I'm thinking we can exclude +Err/Der and ab-dot and VINJO- from testing? Then what is left is:

"færtguhti" is NOT in the lexicon:
"goabbák guojmme" is NOT in the lexicon:
"guhtik guojmme" is NOT in the lexicon:
"guoktajuodevidálågåk" is NOT in the lexicon:
"guoktajuohtevidálågåk" is NOT in the lexicon:
"iesjguhti" is NOT in the lexicon:
"lågenan" is NOT in the lexicon:
"nubbe nubbe" is NOT in the lexicon:
snomos commented 1 month ago

I'm thinking we can exclude +Err/Der and ab-dot and VINJO- from testing?

By default everything containing +Err/ should be removed from testing, so if it is not, that is a bug that needs to be investigated. Could it be that +Err/Der is not defined in root.lexc?

And it makes sense to also exclude VINJO- from testing.

ilm024 commented 1 month ago
"færtguhti" is NOT in the lexicon:
"goabbák guojmme" is NOT in the lexicon:
"guhtik guojmme" is NOT in the lexicon:
"guoktajuodevidálågåk" is NOT in the lexicon:
"guoktajuohtevidálågåk" is NOT in the lexicon:
"iesjguhti" is NOT in the lexicon:
"lågenan" is NOT in the lexicon:
"nubbe nubbe" is NOT in the lexicon:

"færtguhti" is NOT in the lexicon:> not a word, maybe "færtguhtik" "guoktajuodevidálågåk" is NOT in the lexicon: > typo in test? "guoktatjuodevidálågåk" "guoktajuohtevidálågåk" is NOT in the lexicon: > typo in test? "guoktatjuohtevidálågåk" "iesjguhti" is NOT in the lexicon: > not a word, maybe "iesjguhtik" "lågenan" is NOT in the lexicon: > not a word, works only as cmp

I don't know what to do with MWE: "nubbe nubbe" is NOT in the lexicon: "goabbák guojmme" is NOT in the lexicon: "guhtik guojmme" is NOT in the lexicon:

flammie commented 1 month ago
"færtguhti" is NOT in the lexicon:
"goabbák guojmme" is NOT in the lexicon:
"guhtik guojmme" is NOT in the lexicon:
"guoktajuodevidálågåk" is NOT in the lexicon:
"guoktajuohtevidálågåk" is NOT in the lexicon:
"iesjguhti" is NOT in the lexicon:
"lågenan" is NOT in the lexicon:
"nubbe nubbe" is NOT in the lexicon:

"færtguhti" is NOT in the lexicon:> not a word, maybe "færtguhtik" "guoktajuodevidálågåk" is NOT in the lexicon: > typo in test? "guoktatjuodevidálågåk" "guoktajuohtevidálågåk" is NOT in the lexicon: > typo in test? "guoktatjuohtevidálågåk" "iesjguhti" is NOT in the lexicon: > not a word, maybe "iesjguhtik"

these might be typoes in lexc files? I.e. https://github.com/giellalt/lang-smj/blob/main/src/fst/morphology/stems/pronouns.lexc#L354-L366 https://github.com/giellalt/lang-smj/blob/main/src/fst/morphology/stems/pronouns.lexc#L433-L449 and https://github.com/giellalt/lang-smj/blob/main/src/fst/morphology/stems/numerals.lexc#L643-L650

"lågenan" is NOT in the lexicon: > not a word, works only as cmp I don't know what to do with MWE: "nubbe nubbe" is NOT in the lexicon: "goabbák guojmme" is NOT in the lexicon: "guhtik guojmme" is NOT in the lexicon:

mm, for now we can manually filter them from testing, I'll use ! NOT-TO-LEMMATEST in lexc comments for this. Naturally the abovementioned can be excluded with same method if its relevant.

flammie commented 1 month ago

I'm thinking we can exclude +Err/Der and ab-dot and VINJO- from testing?

By default everything containing +Err/ should be removed from testing, so if it is not, that is a bug that needs to be investigated.

That might be a good option, the current template only has: --exclude "(CmpN/Only|ShCmp|\+Cmp\/SplitR| Rreal | R | Rnoun |\+V\+|NOT-TO-LEMMATEST)" although notably fixing it won't be applied the most developed language since this line is modified in all languages.

snomos commented 1 month ago

I'm thinking we can exclude +Err/Der and ab-dot and VINJO- from testing?

By default everything containing +Err/ should be removed from testing, so if it is not, that is a bug that needs to be investigated.

That might be a good option, the current template only has: --exclude "(CmpN/Only|ShCmp|\+Cmp\/SplitR| Rreal | R | Rnoun |\+V\+|NOT-TO-LEMMATEST)" although notably fixing it won't be applied the most developed language since this line is modified in all languages.

I see now that the relevant line in the extract lemma script is not as thorough as it should be regarding noise:

https://github.com/giellalt/giella-core/blob/f73fef326ddd9cabdd9e07f28ba5ddf71ca2d960/scripts/extract-lemmas.sh#L96

This should be fixed.

snomos commented 1 month ago

This should be fixed.

Done in https://github.com/giellalt/giella-core/commit/3b766fb8d4628188ef30c25d26a9e9cf1e771cb1