albbas commented 12 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1342

Date: 2012-05-05T01:11:05+02:00 From: Trond Trosterud <> To: Børre Gaup <> CC: borre.gaup, lene.antonsen, sjur.n.moshagen, trond.trosterud

Last updated: 2012-05-11T21:31:56+02:00

albbas commented 12 years ago

Comment 6179

Date: 2012-05-05 01:11:05 +0200 From: Trond Trosterud <>

To repeat:

Make sure you have a relatively late (and long) abbr.txt file wc -l sme/bin/abbr.txt 51634 sme/bin/abbr.txt
Save this text as a file and run it through the preprocessor:

Text:

mii bat jua. mii bat dá lea? mun galggan dáinna.

Command:

cat Text | preprocess --abbr=sme/bin/abbr.txt |l

Output:

mii bat jua . mii . bat dá lea ? mun galggan dáinna .

The particularily disturbing part is the period following the second "mii". Now, removing the newly added idiom lexicon (leaving only one) also removes the bug (here, the file 2abbr.txt is as abbr.txt, except that over 50000 MWE have been removed:

~/main/gt$wc -l sme/*t 772 sme/2abbr.txt

And now the bug disappears:

~/main/gt$cat ~/Desktop/2.txt | preprocess --abbr=sme/2abbr.txt |l

mii bat jua . mii bat dá lea ? mun galggan dáinna

"mii" is found in the abbr.txt file:

~/main/gt/sme$grep '^mii ' 1abbr.txt mii nu mii nugo mii nuhan mii nuge mii nugen mii nuges mii nugis mii nunai mii nuba mii nubason mii nubahal mii nubahan mii nuban mii nube mii nubeson mii nubehal mii nubehan mii nuhal mii nuhan mii nubat mii nuson mii nu

After removing these, the mii bug disappears.

But this is not the only case. I have had reports on the same behaviour for Biret, and also Biret is aprt of such compounds:

grep '^Biret ' 1abbr.txt |wc -l 224

Some preliminary conclusions:

On the short term, "mii " must out of the abbr file, as is it is a catastrophy. On a longer term, we need a better MWE handling.

albbas commented 12 years ago

Comment 6180

Date: 2012-05-05 11:03:53 +0200 From: Trond Trosterud <>

More investigations: I reduce the whole 50000 abbr.txt file to two lines (the word "nu" may be replaced with anything, I tried with "Trond", with the same result):

LEXICON IDIOM mii nu

And the result is the same. The first mii is kept in the text, but the second one gets an unmotivated period inserted:

~/main/gt/sme$cat ~/Desktop/2.txt | preprocess --abbr=8abbr.txt mii bat jua . mii . bat

albbas commented 12 years ago

Comment 6181

Date: 2012-05-05 11:09:59 +0200 From: Trond Trosterud <>

New try: I repeat the incident. Two factors must be in place:

The IDIOM file contains a two-word expression A B
The input textfile contains the word A twice, neither of the times with B following

The test: The abbr.txt file in extenso: LEXICON IDIOM mii Trond heilt sant

The input file in extenso: mii bat jua. mii bat dá lea? mun galggan dáinna. heilt trist er det er heilt ofte.

The test result: ~/main/gt/sme$cat ~/Desktop/2.txt | preprocess --abbr=8abbr.txt mii bat jua . mii . bat dá lea ? mun galggan dáinna . heilt trist er det er heilt . ofte

albbas commented 12 years ago

Comment 6182

Date: 2012-05-05 11:17:03 +0200 From: Trond Trosterud <>

Here is the maximal text amount to make abbr.txt malfunction. If I add one more word in between the two "heilt", the additional period after the second "heilt" is not added. For this text and any shorter text, the period is added. So it seems to be a scan window for preprocess here.

mii bat jua. mii bat dá lea? mun galggan dáinna. heilt trist er det Kárášjogas leat vihtta veagalváldinášši váidojuvvon politiijaide 2011 rájes dassážii dan jahkái. Politiijat ballet ahte sáhttet vel leat áššit mat eai goassege váidojuvvo.

Veagalváldin Kárášjoga lensmánnekantuvrii lei viđat veagalváldinváidda maid lensmánnekantuvra dearvvašvuođaguovddážis er heilt ofte.

albbas commented 12 years ago

Comment 6195

Date: 2012-05-06 15:28:07 +0200 From: Trond Trosterud <>

So far, the report has been on the preprocessor adding periods where no one should be there. Now I will report the opposite, the failure to add a period where it should be. To repeat: First try a small test:

~/main/gt/sme$echo "Vuoiti oažžu 500 ru. Vuoiti oažžu 500 ru. Vuoiti oažžu 500 kr. Vuoiti oažžu 500 kr." | preprocess --abbr=bin/abbr.txt Vuoiti oažžu 500 ru. . Vuoiti oažžu 500 ru. . Vuoiti oažžu 500 kr. . Vuoiti oažžu 500 Use of uninitialized value $next_word in pattern match (m//) at /Users/trond/main/gt/script/preprocess line 566, <> line 1. kr.

Here it works (but note the error message).

Then do the same in a real-size text:

cat biggies/gt/sme/corp/testkorpus.txt| preprocess --abbr=main/gt/sme/bin/abbr.txt |l

Then, search for Vuoiti, and you get:

Vuoiti oažžu 500 ru. Juohke heasta borrá

Thus: In this case, "ru." does not behave as the intransitive abbreviation it should, and as it did in the first example.

albbas commented 12 years ago

Comment 6217

Date: 2012-05-07 17:24:22 +0200 From: Lene Antonsen <>

Eksempel på setning som er umulig å analysere pga feil preprossering selv om jeg forsøker å preprossere og analysere som enkeltsetning.

echo 'Dearvvašvuođabargiilága mii gieđahallá dieđuid addima pasienttaide, ja pasientavuoigatvuođalága § 1-3, mii gieđahallá vuoigatvuođa informerejuvvon miehtamii, leat ovdamearkkat dakkár vuoigatvuođain.' | preprocess --abbr=sme/bin/abbr.txt | l

, ja pasientavuoigatvuođalága § 1-3 , mii . gieđahallá vuoigatvuođa informerejuvvon miehtamii

albbas commented 12 years ago

Comment 6231

Date: 2012-05-08 20:12:30 +0200 From: Lene Antonsen <>

Jeg har testa med versjon 53888 av preprocess, og med den får jeg ikke slike bugger som er nevnt i kommentarene her.

albbas commented 12 years ago

Comment 6232

Date: 2012-05-08 20:29:07 +0200 From: Trond Trosterud <>

Den eine skilnaden mellom 53888 og neste versjon (53890) er denne:

} elsif ($nopunct =~ /[§\d\pP\pL]/ && $words_aref->[0]{word} =~ /^[$parentheses]/o) {
} elsif ($nopunct =~ /[§\d\pP\pL]/ && $words_aref->[0]{word} =~ /^[$parentheses°]/o) {

For meg ser smultringen ° ut som rusk, er det det?

albbas commented 12 years ago

Comment 6236

Date: 2012-05-08 21:27:57 +0200 From: Børre Gaup <>

Smultringen er ikke rusk, jfr. commitmeldinger i 53889 og 53890 Modified: trunk/tools/abbrtester/abbrtester.py Log: Degree sign behind numeral and dot wreaks havoc og Modified: trunk/gt/script/preprocess Log: Fixed numeral+dot followed by degree sign

albbas commented 12 years ago

Comment 6237

Date: 2012-05-08 22:50:17 +0200 From: Trond Trosterud <>

Ok, så veit vi det. men det er likevel frå og med denne innsjekkinga (eller neste?) at preprocess ikkje fungerer. Mitt framlegg er at du ser på denne bugen, i og med at du sjølv best hugsar kva du har gjort og korfor.

albbas commented 12 years ago

Comment 6244

Date: 2012-05-09 16:13:23 +0200 From: Lene Antonsen <>

script$ svn ci -m "Tilbake til r53888 inntil videre, pga bug #1342. Vi må ha noe som fungerer." preprocess Sending preprocess Transmitting file data . Committed revision 58499.

Jeg gjorde det slik fordi jeg/vi er avhengig av en preprocess som fungerer.

albbas commented 12 years ago

Comment 6246

Date: 2012-05-10 07:15:34 +0200 From: Lene Antonsen <>

(In reply to comment #10)

script$ svn ci -m "Tilbake til r53888 inntil videre, pga bug #1342. Vi må ha noe som fungerer." preprocess Sending preprocess Transmitting file data . Committed revision 58499.

Jeg gjorde det slik fordi jeg/vi er avhengig av en preprocess som fungerer.

Denne versjonen fungerer mye bedre enn den nyeste. men det er fremdeles en bug.

I følgende tekst preprosserer ru. feil i begge setningene:

Vuosttaš logi minuvtta lei buorre áigodat Nordlysa ektui. Vuoiti oažžu 500 ru. Juohke heasta borrá sullii 6 kilu suinniid beaivái. Sus leat golbma oappá. Mun oasttán guokte girjji. Son lea guoktelogi jagi boaris. Sin stáljas leat vihtta dámmá ja okta ore. Vuoiti oažžu 500 ru. Juohke heasta borrá sullii 6 kilu suinniid beaivái.

Men når jeg fjerner den første setningen, så preprosserer ru. riktig i begge setningene (!):

Vuoiti oažžu 500 ru. Juohke heasta borrá sullii 6 kilu suinniid beaivái. Sus leat golbma oappá. Mun oasttán guokte girjji. Son lea guoktelogi jagi boaris. Sin stáljas leat vihtta dámmá ja okta ore. Vuoiti oažžu 500 ru. Juohke heasta borrá sullii 6 kilu suinniid beaivái.

albbas commented 12 years ago

Comment 6267

Date: 2012-05-11 11:35:04 +0200 From: Børre Gaup <>

Some more research on when the mii. bug was introduced shows that it was introduced in commit 55394, it has the commit message: Hack to break the infinite loop preprocess went into an infinite loop if it found an idiom without an ending punctum, but in fact expected an abbreviation with an ending punctum.

albbas commented 12 years ago

Comment 6270

Date: 2012-05-11 16:58:12 +0200 From: Børre Gaup <>

The mii bug has been fixed in commit 58606

albbas commented 12 years ago

Comment 6273

Date: 2012-05-11 21:31:56 +0200 From: Børre Gaup <>

Added tests and solutions for the rest of the problems mentioned in this bug report in commits 58612, 58613, 58614

giellalt / bugzilla-dummy

Multiwords (LEXICON IDIOM) causes the preprocessor to malfunction. (Bugzilla Bug 1342) #821

Bugzilla Bug 1342

Comment 6179

Comment 6180

Comment 6181

Comment 6182

Comment 6195

Comment 6217

Comment 6231

Comment 6232

Comment 6236

Comment 6237

Comment 6244

Comment 6246

Comment 6267

Comment 6270

Comment 6273