giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
https://giellalt.uit.no
GNU General Public License v3.0
6 stars 1 forks source link

compounds that involve numerals/captial letters/propernouns and are somewhat productive (Bugzilla Bug 2636) #453

Open albbas opened 4 years ago

albbas commented 4 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 2636

Date: 2019-11-06T14:14:02+01:00 From: Linda Wiechetek <> To: Tommi A Pirinen <> CC: linda.wiechetek, sjur.n.moshagen, thomas.omma, trond.trosterud, unhammer+apertium

Last updated: 2020-09-22T16:21:55+02:00

albbas commented 4 years ago

Comment 13807

Date: 2019-11-06 14:14:02 +0100 From: Linda Wiechetek <>

Gudnejahtte 80 jahkásačča

It should bbe 80-jahkásačča, but obviously we cannot list all combinations of age and jahkásačča in the lexicon, so it needs to be somewhat partially dynamic. How should we do that?

albbas commented 4 years ago

Comment 13827

Date: 2019-12-17 12:19:24 +0100 From: Sjur Nørstebø Moshagen <>

Det er mogleg å få til dette i lexc, men då må vi fyrst avgrensa dei tala som går til leksikonet som lagar slike samansetjingar. Pr i dag er det nesten alle slags taluttrykk som kan gå dit, og det er heilt klart ikkje bra.

albbas commented 4 years ago

Comment 13896

Date: 2020-04-17 16:24:02 +0200 From: Linda Wiechetek <>

med tanke på leveår burde det kanskje være 1-200 max? I bibelen er det vel en del som blir eldre enn 100+. Er det du som gjør det i fsten?

albbas commented 4 years ago

Comment 13982

Date: 2020-09-11 16:54:20 +0200 From: Linda Wiechetek <>

We want to fix this as our tokenization of potential compounds and compound error detection is based on that. The current evaluation shows that the lack of lexicalization leads to false negatives.

I have changed the assignee to Tommi (is that ok Sjur?) and I have a number of other examples here:

So there are different types;

500 ruvdnosaččain > 500-ruvdnosaččain 500 ruvdnosaččain > 500-ruvdnosaččain

P- čuoggá > P-čuoggá

Fuomáš P- čuoggá (2,3) ja Q-čuoggá (3,2) erohusa dán bajit govvosis.

Biret muoŧŧa > Biret-muoŧŧa

and there is more

So instead of listing all of them, I would like to list them in some kind of regular expression if that's possible.

albbas commented 4 years ago

Comment 13983

Date: 2020-09-11 16:56:20 +0200 From: Linda Wiechetek <>

here are some sentences for testing the grammar checker:

6.54 Máret áigu divodit iežas oađđenlanja. Son oastá liimma 120 ruvdnui, guokte málagušta 35 ruvdnui bihttá ja 8 mehtera tapehta mii máksá 18 ruvnno mehteris. Son máksá 500 ruvdnosaččain.

6.32 Gáren fitná málagávppis. Doppe oastá son vihtta málagušta mat mákset 19 ruvnno bihttá, ja ovtta lihttara mála mii máksá 75 ruvnno. Man olu oažžu son ruovttoluotta jus máksá 200 ruvdnosaččain?

a) Soai máksiba 200 ruvdnosaččain. Man olu oažžuba ruovttoluotta?

Fuomáš P- čuoggá (2,3) ja Q-čuoggá (3,2) erohusa dán bajit govvosis.

albbas commented 3 years ago

Comment 14020

Date: 2020-09-22 16:21:55 +0200 From: Linda Wiechetek <>

After Tommi made a testing fst for these, we get the following, which is great (just need to make a mwe-dis rule to disambiguate correctly):

"<500>" "500" Num Sem/ID giellalt/bugzilla-dummy#1->1 "500" Num Arab Sg Nom @HNOUN MAP:22849:hnounNom giellalt/bugzilla-dummy#1->1 "500" Num Arab Sg Loc Attr @HNOUN MAP:22857:hnounAdvl giellalt/bugzilla-dummy#1->1 "500" Num Arab Sg Ill Attr @HNOUN MAP:22857:hnounAdvl giellalt/bugzilla-dummy#1->1 ; "500" Num Arab Sg Gen REMOVE:24547:r3569 ; "500" Num Arab Sg Acc @HNOUN MAP:22857:hnounAdvl REMOVE:24549:r3570 "< ruvdnosaččain>" "ruvdnosaš" A Sem/Dummytag Pl Loc Err/Orth @ADVL> SELECT:19906:r2793 MAP:22867:HNOUN<advl giellalt/bugzilla-dummy#2->2 "ruvdnosaš" A Sem/Dummytag Pl Loc @ADVL> SELECT:19906:r2793 MAP:22867:HNOUN<advl giellalt/bugzilla-dummy#2->2 ; "ruvdnosaš" A Sem/Dummytag Sg Com Err/Orth SELECT:19906:r2793 ; "ruvdnosaš" A Sem/Dummytag Sg Com SELECT:19906:r2793 ; "500-ruvdnosaš" A Sem/Dummytag Pl Loc Err/Orth Err/SpaceCmp REMOVE:2525:AttrCmp ; "500-ruvdnosaš" A Sem/Dummytag Pl Loc Err/SpaceCmp REMOVE:2525:AttrCmp ; "500-ruvdnosaš" A Sem/Dummytag Sg Com Err/Orth Err/SpaceCmp REMOVE:2525:AttrCmp ; "500-ruvdnosaš" A Sem/Dummytag Sg Com Err/SpaceCmp REMOVE:2525:AttrCmp "<.>" "." CLB giellalt/bugzilla-dummy#3->3