giellalt / bugzilla-dummy

0 stars 0 forks source link

ö instead of š in analyzed text on xserve (Bugzilla Bug 1444) #135

Closed albbas closed 11 years ago

albbas commented 12 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1444

Date: 2012-09-28T08:49:21+02:00 From: Linda Wiechetek <> To: Børre Gaup <> CC: ciprian.gerstenberger, lene.antonsen, sjur.n.moshagen, trond.trosterud

Last updated: 2013-05-06T10:33:40+02:00

albbas commented 12 years ago

Comment 6950

Date: 2012-09-28 08:49:21 +0200 From: Linda Wiechetek <>

When going through the syntactically analyzed corpus from 1.6.2012, I came across several instaces of the following:

"<juoidá>" "juoga" Pron Indef Sg Acc @<OBJ "" "mii" Pron Rel Sg Acc @OBJ> "" "leat" V IV Ind Prs Sg3 @+FMAINV "<vejolaö>" "vejolaö" ? @X "<gohčodit>" "gohčodit" V TV Inf @-FMAINV "gohčodit" V TV Ind Prs Pl3 @+FMA INV "<guvllolaö>" "guvllolaö" ? @X "<álbmotriekteilmman>" "álbmotriekteilmman" ? @X "" "dihto" A Attr @>N "" "suorgi" N Pl Loc @<ADVL

albbas commented 11 years ago

Comment 7527

Date: 2012-12-11 18:31:17 +0100 From: Sjur Nørstebø Moshagen <>

This is a bug, not an enhancement. Børre, could you try to have a look at this in between?

albbas commented 11 years ago

Comment 7528

Date: 2012-12-11 18:32:08 +0100 From: Trond Trosterud <>

It is still with us:

analysed$grep vejolaö 2012-06-01/sme.txt|wc -l 667 analysed$grep vejolaö 2012-11-30/sme.txt|wc -l 657

albbas commented 11 years ago

Comment 7529

Date: 2012-12-11 18:47:47 +0100 From: Trond Trosterud <>

... but it is restricted to the divvun server, where it is very common :-(

grep '[ aeoiu]ö' 2012-11-30/sme*ccat.txt|wc -l 1102

Strange enough, the problem increases x 14 when we do a dependency analysis :-/

grep '[ aeoiu]ö' 2012-11-30/sme*.dep.txt|wc -l 12948

As already mentioned, it is found on the divvun server, not outside of it:

divvun:

analysed$grep '[aeoiu ]ö' 2012-01-02/sme*.txt|kwic-snt 'ö'

alaö sámekonvenöuvdna Suoma-Norgga-Ruoŧa-Sámi áööedovdi joavkku álgohápmi Geigej alaö sámekonvenöuvdna Suoma-Norgga-Ruoŧa-Sámi áööedovdi joavkku álgohápmi Nammad enöuvdnamearrádusaid ekonomalaö váikkuhusaid. Áööedovdijoavkku lea ofelaötán dat ijoavku eaktuda ahte konvenöuvnna álgohámi ja áööedovdijoavkku árvalussii gullev rraláganat luonddu dáfus ja leat siskkáldasat áööedovdijoavkkus leamaö dárkilis ja artihkkal 42 Boazodoallu sámi ealáhussan. Áööedovdijoavku eaktuda ahte konve de oppalaö hápmái stuorra sárgosiid dáfus, de áööedovdijoavku lea gávnnahan vejoš

freecorpus on my mac:

ccat -r admin/ | grep " Suoma-Norgga-Ruoŧa-Sámi á" Henriksen, Scheinin, Åhrén: Sámi álbmoga iešmearrideami vuoigatvuohta, s. 346-347, i Davviriikkalaš sámekonvenšuvdna: Suoma-Norgga-Ruoŧa-Sámi áššidovdi joavkku álgohápmi, geigejuvvon golggotmánu 26. b. 2005. Oslo 2005 ¶ Davviriikkalaš sámekonvenšuvdna, s. 137. Suoma-Norgga-Ruoŧa-Sámi áššedovdi joavkku álgohápmi. Geigejuvvui golggotmánu 26. b. 2005. ¶

The net sum of this is that we have an unreliable syntax testbed due to an error we do not understand.

albbas commented 11 years ago

Comment 8245

Date: 2013-05-06 10:33:40 +0200 From: Børre Gaup <>

In the most recent analysed directory, 2013-04-11, grep '[ aeoiu]ö' sme*.dep|wc -l gives 423 hits.

These are the valid hits: sme-nob-admin.dep:"<fltnodatekonomalaööat>" sme-nob-admin.dep: "fltnodatekonomalaööat" ? @X #11->11 sme-nob-admin.dep:"<muorjeöoaggin>" sme-nob-admin.dep: "muorjeöoaggin" ? @X #4->4 sme-nob-admin.dep:"<Muorraöuollan>" sme-nob-admin.dep: "Muorraöuollan" ? @X #1->1 sme-nob-admin.dep:"<guolleöoliiguin>" sme-nob-admin.dep: "guolleöoliiguin" ? @X #9->9

The rest are either propernouns like Päiviö or South Sámi text.

grep vejolaö *.ccat gives zero hits.