Closed albbas closed 10 years ago
Date: 2013-12-10 16:20:56 +0100
From: Linda Wiechetek <
I noticed two things that we might want to change? in the Korp output:
Muhto mánná dál bassojuvvo , dego ovdalis lea jo muitaluvvon , ja de gissojuvvo ADV liinniid ja bohcconáhkiid sisa ja biddjojuvvo ADV gietkama sisa . (Muitalus Sámiid birra)
underscores in MWE "dan_botta_go", "dassážii_go"
...vahkkui ja herskostallen su valljugas gorudiin suollemas jurdagiinnán dan_botta_go rogganmašiidnavuoddji bogai biergasiinnis su sisa .
Go mánná lea dearvan riegádan , de giessaluvvo ruksesmiesenáhki sisa dassážii_go ožžot čázi liegganit . (Muitalus Sámiid birra)
Are there any reasons for leaving the underscores?
Date: 2013-12-10 16:25:52 +0100
From: Trond Trosterud <
The underscores were there to avoid space. So in order to change we need two types of arguments:
Date: 2013-12-11 00:00:47 +0100
From: Ciprian Gerstenberger <
(In reply to comment #1)
- arguments for changing, including telling what they should be changed into. into normal whitespace is ok i a word but not i other positional attributes
I just tried with a new corpus compilation:
Dát gusto maiddái Fylkkamánnái , dan botta go botta go lágain , 5 botta go botta go son lei oađđimin , Ipmil válddii ovtta su Ieš son čuoččui sin luhtte muora vuolde dan botta go botta go sii boradedje .
- arguments showing that the change will not give rise to problems.
perhaps with strings that had originally underscores, i.e., we might replace too many underscores: this is similar to "Ein Poet liebt Olivenöl." ==> "Ein Poet liebt Olivenöl." ==> "Ein Pöt liebt Olivenöl."
Date: 2013-12-11 08:25:19 +0100
From: Trond Trosterud <
The underscores are added in lookup2cg. We could use NBSP instead, but it is error-prone. Here is when it gets introduced:
tf-hsl-m0016:sme ttr000$ echo man nu ja nu | preprocess man nu ja nu tf-hsl-m0016:sme ttr000$ echo man nu ja nu | preprocess --abbr=bin/abbr.txt man nu ja nu tf-hsl-m0016:sme ttr000$ echo man nu ja nu | preprocess --abbr=bin/abbr.txt|usme man nu mii nu+MWE+Pron+Indef+Sg+Gen
ja ja+CC
nu nu+Adv
tf-hsl-m0016:sme ttr000$ echo man nu ja nu | preprocess --abbr=bin/abbr.txt|usme|lookup2cg
"
In sme-dis.rle we refer to two entries with "_" one of them twice:
tf-hsl-m0016:sme ttr000$ cat src/sme-dis.rle|cut -d"#" -f1|uniq|grep ''|cut -d" " -f2-|grep ''|sort:
GRADE-ADV = (..) "measta" "menddo" "muhtun_muddui" SEAMMAX = "seamma_ládje" "seamma_láhkái" ;
I do not know whether vislcg3 handles spaces within quotation marks.
We could remove it in the presentation, but if we are to link it to analysis, we still need to identify dan_botta_go as one unit.
Date: 2013-12-11 08:35:47 +0100
From: Ciprian Gerstenberger <
(In reply to comment #3)
We could remove it in the presentation, but if we are to link it to analysis, we still need to identify dan_botta_go as one unit.
This is no problem, I guess. CWB allows whitespaces wonly in the word string so the very last compilation of the data looks like that:
<w word="dan botta go" lemma="dan_botta_go" pos="CS"
Ergo: we keep the information about mwe in the lemma.
Date: 2013-12-12 12:14:41 +0100
From: Ciprian Gerstenberger <
Now I have changed and recompiled the corpus. There is though a problem with the dependency tree, which doesn't show up when clicking on the button.
Muitalus corpus will be corrected soon, too.
Date: 2014-04-25 20:34:56 +0200
From: Ciprian Gerstenberger <
The bug is gone. Bug closed
This issue was created automatically with bugzilla2github
Bugzilla Bug 1750
Date: 2013-12-10T16:20:56+01:00 From: Linda Wiechetek <>
To: Ciprian Gerstenberger <>
CC: lene.antonsen, sjur.n.moshagen, trond.trosterud
Last updated: 2014-04-25T20:34:56+02:00