giellalt / bugzilla-dummy

0 stars 0 forks source link

Double space in front of "Eanas oassi" suggest "Eanas oassi" (Bugzilla Bug 2585) #1762

Closed albbas closed 5 years ago

albbas commented 5 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 2585

Date: 2019-05-20T15:58:44+02:00 From: Børre Gaup <> To: Linda Wiechetek <> CC: linda.wiechetek, sjur.n.moshagen, thomas.omma, trond.trosterud, unhammer+apertium

Last updated: 2019-09-06T20:52:34+02:00

albbas commented 5 years ago

Comment 13427

Date: 2019-05-20 15:58:44 +0200 From: Børre Gaup <>

sme $ echo " olbmui. Eanas oassi." | divvun-checker -l se -n smegram {"errs":[["Eanas oassi",10,21,"double-space-before","Leat guokte gaskka ovdal \" oassi\"",["Eanas oassi"],"Sátnegaskameattáhusat"]],"text":" olbmui. Eanas oassi."}

If oassi is replaced with guossi, or Eanáš with Eanas the correct suggestion is given:

sme $ echo " olbmui. Eanas guossit." | divvun-checker -l se -n smegram {"errs":[[". Eanas",7,15,"double-space-before","Leat guokte gaskka ovdal \"Eanas\"",[". Eanas"],"Sátnegaskameattáhusat"]],"text":" olbmui. Eanas guossit."}

sme $ echo " olbmui. Eanáš oassi." | divvun-checker -l se -n smegram {"errs":[[". Eanáš",7,15,"double-space-before","Leat guokte gaskka ovdal \"Eanáš\"",[". Eanáš"],"Sátnegaskameattáhusat"]],"text":" olbmui. Eanáš oassi."}

sme $ echo eanas | husmeNorm eanas eanas+Adv 0,000000 eanas eanas+Pron+Indef+Sg+Nom 0,000000 eanas eanas+A+Attr 0,000000

sme $ echo eanáš | husmeNorm eanáš eanášit+V+TV+Imprt+ConNeg 0,000000 eanáš eanášit+V+TV+Imprt+Sg2 0,000000 eanáš eanášit+V+TV+Ind+Prs+ConNeg 0,000000 eanáš eanáš+Adv 0,000000

albbas commented 5 years ago

Comment 13564

Date: 2019-08-17 20:06:08 +0200 From: Linda Wiechetek <>

What exactly is the problem? I can't see a difference in the headline. Could you check if the problem still exists?

albbas commented 5 years ago

Comment 13575

Date: 2019-08-18 15:20:03 +0200 From: Linda Wiechetek <>

Now I see the problem. I sent you an email about it. The difference between Eanas oassi and Eanáš oassi is that the first one is listed as a one word compound. I'm not sure how that influences the matter.

albbas commented 5 years ago

Comment 13612

Date: 2019-08-20 14:38:08 +0200 From: Sjur Nørstebø Moshagen <>

The underlying problem is that the whitespace analyser is applied directly after the morphological analysis & tokenisation, which means that the tag

meant to target the two spaces in front of Eanas is added to all readings of the following word. So far so good, and as it should be. But when that following word is ambiguous in its tokenisation, as in this case, and it resolves to two tokens, the tag is dragged along in both the new cohorts. And this leads to the strange situation that also the following word 'oassi' is tagged as being preceded by two spaces, although that is not the case. One solution would be to move the whitespace tagging till after mwe disambiguation. The problem with that is that we then loose the information from the whitespace tagger that could be useful when disambiguating ambiguous tokenisations. But maybe we don't use that information at all. Linda, Kevin - other ideas? Comments?
albbas commented 5 years ago

Comment 13618

Date: 2019-08-21 09:43:51 +0200 From: Kevin Brubeck Unhammer <<unhammer+apertium>>

Whitespace-analyser kan gi taggane

og av desse ser eg berre brukt, i éin regel: SELECT:before-paragraph ("." CLB) IF (1*> (>>>) BARRIER (>>>) LINK 1 ); ## Dat lea eanet go 10. Dat lei boahtán. Går det an å disambiguera «10.» her utan å referera til ? Alternativt er det ikkje noko problem å ha *to* whitespace-analysers køyrande, éin som legg på meir «informative» taggar som (og køyrer før mwe-dis.cg3), og éin som legg på feiltaggar som (etter cg-mwesplit).
albbas commented 5 years ago

Comment 13638

Date: 2019-08-29 00:08:11 +0200 From: Linda Wiechetek <>

Well, right now we mess up anyway.. I tested "Dat lea eanet go 10. Dat lea eanet go 10. olbmui."

and we get:

"" "go" CS @CVP SELECT:8116:r1180 MAP:12871:r10 SELECT:13056:r1461 ; "go" CS @CNP SELECT:8116:r1180 MAP:12871:r10 SELECT:13056:r1461 ; "go" Pcle Qst SELECT:8116:r1180 : "<10.>" "10" A Arab Ord Attr @>N MAP:21848:r86 ; "." CLB "<.>" ; "10" Num Sem/ID "<10>" REMOVE:2689:longest-match ; "." CLB "<.>" ; "10" Num Arab Sg Nom "<10>" REMOVE:2689:longest-match ; "." CLB "<.>" ; "10" Num Arab Sg Loc Attr "<10>" REMOVE:2689:longest-match ; "." CLB "<.>" ; "10" Num Arab Sg Ill Attr "<10>" REMOVE:2689:longest-match ; "." CLB "<.>" ; "10" Num Arab Sg Gen "<10>" REMOVE:2689:longest-match ; "." CLB "<.>" ; "10" Num Arab Sg Acc "<10>" REMOVE:2689:longest-match : "" "dat" Pron Dem Sg Nom @SUBJ> SELECT:17765:r2334 MAP:23324 ; "dat" Pcle SELECT:17765:r2334 ; "dat" Pron Dem Pl Nom REMOVE:13658:r1619

"It is more than 10" should give us "." CLB.. I'll have a look at a possible rule.

albbas commented 5 years ago

Comment 13639

Date: 2019-08-29 00:11:28 +0200 From: Linda Wiechetek <>

Ahh.. linjeshift... Altså vi klarer å disambiguere i denne setninga uten å referera til , ja, men æ vet ikkje korvidt vi treng å generalisere.

albbas commented 5 years ago

Comment 13654

Date: 2019-09-02 14:16:37 +0200 From: Sjur Nørstebø Moshagen <>

Eg flyttar blankteiknsanalysatoren til lenger ut i kommandorekka. Linda sin regel er ikkje lenger avhengig av denne taggen.

albbas commented 5 years ago

Comment 13662

Date: 2019-09-06 20:52:34 +0200 From: Sjur Nørstebø Moshagen <>

(In reply to Kevin Brubeck Unhammer from comment #4)

Alternativt er det ikkje noko problem å ha to whitespace-analysers køyrande, éin som legg på meir «informative» taggar som

(og køyrer før mwe-dis.cg3), og éin som legg på feiltaggar som (etter cg-mwesplit).

Eg valde å gjera det på denne måten, og no funkar ting som dei skal:

$ echo " olbmui. Eanas oassi." | divvun-checker -a se.zcheck | jq . { "errs": [ [ ". Eanas", 7, 15, "double-space-before", "Leat guokte gaskka ovdal \"Eanas\"", [ ". Eanas" ], "Sátnegaskameattáhusat" ] ], "text": " olbmui. Eanas oassi." }

Eg avsluttar lusmeldinga.