giellalt / bugzilla-dummy

0 stars 0 forks source link

aesthetics: underscores and part-of-speech in the Korp search output (Bugzilla Bug 1750) #265

Closed albbas closed 10 years ago

albbas commented 10 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1750

Date: 2013-12-10T16:20:56+01:00 From: Linda Wiechetek <> To: Ciprian Gerstenberger <> CC: lene.antonsen, sjur.n.moshagen, trond.trosterud

Last updated: 2014-04-25T20:34:56+02:00

albbas commented 10 years ago

Comment 8745

Date: 2013-12-10 16:20:56 +0100 From: Linda Wiechetek <>

I noticed two things that we might want to change? in the Korp output:

Muhto mánná dál bassojuvvo , dego ovdalis lea jo muitaluvvon , ja de gissojuvvo ADV liinniid ja bohcconáhkiid sisa ja biddjojuvvo ADV gietkama sisa . (Muitalus Sámiid birra)

underscores in MWE "dan_botta_go", "dassážii_go"

...vahkkui ja herskostallen su valljugas gorudiin suollemas jurdagiinnán dan_botta_go rogganmašiidnavuoddji bogai biergasiinnis su sisa .

Go mánná lea dearvan riegádan , de giessaluvvo ruksesmiesenáhki sisa dassážii_go ožžot čázi liegganit . (Muitalus Sámiid birra)

Are there any reasons for leaving the underscores?

albbas commented 10 years ago

Comment 8746

Date: 2013-12-10 16:25:52 +0100 From: Trond Trosterud <>

The underscores were there to avoid space. So in order to change we need two types of arguments:

  1. arguments for changing, including telling what they should be changed into. NBSP?
  2. arguments showing that the change will not give rise to problems.
albbas commented 10 years ago

Comment 8747

Date: 2013-12-11 00:00:47 +0100 From: Ciprian Gerstenberger <>

(In reply to comment #1)

  1. arguments for changing, including telling what they should be changed into. into normal whitespace is ok i a word but not i other positional attributes

I just tried with a new corpus compilation:

Dát gusto maiddái Fylkkamánnái , dan botta go botta go lágain , 5 botta go botta go son lei oađđimin , Ipmil válddii ovtta su Ieš son čuoččui sin luhtte muora vuolde dan botta go botta go sii boradedje .

  1. arguments showing that the change will not give rise to problems.

perhaps with strings that had originally underscores, i.e., we might replace too many underscores: this is similar to "Ein Poet liebt Olivenöl." ==> "Ein Poet liebt Olivenöl." ==> "Ein Pöt liebt Olivenöl."

albbas commented 10 years ago

Comment 8748

Date: 2013-12-11 08:25:19 +0100 From: Trond Trosterud <>

The underscores are added in lookup2cg. We could use NBSP instead, but it is error-prone. Here is when it gets introduced:

tf-hsl-m0016:sme ttr000$ echo man nu ja nu | preprocess man nu ja nu tf-hsl-m0016:sme ttr000$ echo man nu ja nu | preprocess --abbr=bin/abbr.txt man nu ja nu tf-hsl-m0016:sme ttr000$ echo man nu ja nu | preprocess --abbr=bin/abbr.txt|usme man nu mii nu+MWE+Pron+Indef+Sg+Gen

ja ja+CC

nu nu+Adv

tf-hsl-m0016:sme ttr000$ echo man nu ja nu | preprocess --abbr=bin/abbr.txt|usme|lookup2cg "" "mii_nu" MWE Pron Indef Sg Gen "" "ja" CC "" "nu" Adv

In sme-dis.rle we refer to two entries with "_" one of them twice:

tf-hsl-m0016:sme ttr000$ cat src/sme-dis.rle|cut -d"#" -f1|uniq|grep ''|cut -d" " -f2-|grep ''|sort:

GRADE-ADV = (..) "measta" "menddo" "muhtun_muddui" SEAMMAX = "seamma_ládje" "seamma_láhkái" ;

I do not know whether vislcg3 handles spaces within quotation marks.

We could remove it in the presentation, but if we are to link it to analysis, we still need to identify dan_botta_go as one unit.

albbas commented 10 years ago

Comment 8749

Date: 2013-12-11 08:35:47 +0100 From: Ciprian Gerstenberger <>

(In reply to comment #3)

We could remove it in the presentation, but if we are to link it to analysis, we still need to identify dan_botta_go as one unit.

This is no problem, I guess. CWB allows whitespaces wonly in the word string so the very last compilation of the data looks like that:

<w word="dan botta go" lemma="dan_botta_go" pos="CS"

Ergo: we keep the information about mwe in the lemma.

albbas commented 10 years ago

Comment 8752

Date: 2013-12-12 12:14:41 +0100 From: Ciprian Gerstenberger <>

Now I have changed and recompiled the corpus. There is though a problem with the dependency tree, which doesn't show up when clicking on the button.

Muitalus corpus will be corrected soon, too.

http://gtweb.uit.no/korp/#corpus=sme_corpus_20131211&page=0&search-tab=2&search=cqp|[word+%3D+%22dan+botta+go%22]

albbas commented 10 years ago

Comment 9342

Date: 2014-04-25 20:34:56 +0200 From: Ciprian Gerstenberger <>

The bug is gone. Bug closed