Closed albbas closed 12 years ago
Date: 2012-05-05 01:11:05 +0200
From: Trond Trosterud <
To repeat:
Make sure you have a relatively late (and long) abbr.txt file wc -l sme/bin/abbr.txt 51634 sme/bin/abbr.txt
Save this text as a file and run it through the preprocessor:
Text:
mii bat jua. mii bat dá lea? mun galggan dáinna.
Command:
cat Text | preprocess --abbr=sme/bin/abbr.txt |l
Output:
mii bat jua . mii . bat dá lea ? mun galggan dáinna .
The particularily disturbing part is the period following the second "mii". Now, removing the newly added idiom lexicon (leaving only one) also removes the bug (here, the file 2abbr.txt is as abbr.txt, except that over 50000 MWE have been removed:
~/main/gt$wc -l sme/*t 772 sme/2abbr.txt
And now the bug disappears:
~/main/gt$cat ~/Desktop/2.txt | preprocess --abbr=sme/2abbr.txt |l
mii bat jua . mii bat dá lea ? mun galggan dáinna
"mii" is found in the abbr.txt file:
~/main/gt/sme$grep '^mii ' 1abbr.txt mii nu mii nugo mii nuhan mii nuge mii nugen mii nuges mii nugis mii nunai mii nuba mii nubason mii nubahal mii nubahan mii nuban mii nube mii nubeson mii nubehal mii nubehan mii nuhal mii nuhan mii nubat mii nuson mii nu
After removing these, the mii bug disappears.
But this is not the only case. I have had reports on the same behaviour for Biret, and also Biret is aprt of such compounds:
grep '^Biret ' 1abbr.txt |wc -l 224
Some preliminary conclusions:
On the short term, "mii " must out of the abbr file, as is it is a catastrophy. On a longer term, we need a better MWE handling.
Date: 2012-05-05 11:03:53 +0200
From: Trond Trosterud <
More investigations: I reduce the whole 50000 abbr.txt file to two lines (the word "nu" may be replaced with anything, I tried with "Trond", with the same result):
LEXICON IDIOM mii nu
And the result is the same. The first mii is kept in the text, but the second one gets an unmotivated period inserted:
~/main/gt/sme$cat ~/Desktop/2.txt | preprocess --abbr=8abbr.txt mii bat jua . mii . bat
Date: 2012-05-05 11:09:59 +0200
From: Trond Trosterud <
New try: I repeat the incident. Two factors must be in place:
The test: The abbr.txt file in extenso: LEXICON IDIOM mii Trond heilt sant
The input file in extenso: mii bat jua. mii bat dá lea? mun galggan dáinna. heilt trist er det er heilt ofte.
The test result: ~/main/gt/sme$cat ~/Desktop/2.txt | preprocess --abbr=8abbr.txt mii bat jua . mii . bat dá lea ? mun galggan dáinna . heilt trist er det er heilt . ofte
Date: 2012-05-05 11:17:03 +0200
From: Trond Trosterud <
Here is the maximal text amount to make abbr.txt malfunction. If I add one more word in between the two "heilt", the additional period after the second "heilt" is not added. For this text and any shorter text, the period is added. So it seems to be a scan window for preprocess here.
mii bat jua. mii bat dá lea? mun galggan dáinna. heilt trist er det Kárášjogas leat vihtta veagalváldinášši váidojuvvon politiijaide 2011 rájes dassážii dan jahkái. Politiijat ballet ahte sáhttet vel leat áššit mat eai goassege váidojuvvo.
Veagalváldin Kárášjoga lensmánnekantuvrii lei viđat veagalváldinváidda maid lensmánnekantuvra dearvvašvuođaguovddážis er heilt ofte.
Date: 2012-05-06 15:28:07 +0200
From: Trond Trosterud <
So far, the report has been on the preprocessor adding periods where no one should be there. Now I will report the opposite, the failure to add a period where it should be. To repeat: First try a small test:
~/main/gt/sme$echo "Vuoiti oažžu 500 ru. Vuoiti oažžu 500 ru. Vuoiti oažžu 500 kr. Vuoiti oažžu 500 kr." | preprocess --abbr=bin/abbr.txt Vuoiti oažžu 500 ru. . Vuoiti oažžu 500 ru. . Vuoiti oažžu 500 kr. . Vuoiti oažžu 500 Use of uninitialized value $next_word in pattern match (m//) at /Users/trond/main/gt/script/preprocess line 566, <> line 1. kr.
Here it works (but note the error message).
Then do the same in a real-size text:
cat biggies/gt/sme/corp/testkorpus.txt| preprocess --abbr=main/gt/sme/bin/abbr.txt |l
Then, search for Vuoiti, and you get:
Vuoiti oažžu 500 ru. Juohke heasta borrá
Thus: In this case, "ru." does not behave as the intransitive abbreviation it should, and as it did in the first example.
Date: 2012-05-07 17:24:22 +0200
From: Lene Antonsen <
Eksempel på setning som er umulig å analysere pga feil preprossering selv om jeg forsøker å preprossere og analysere som enkeltsetning.
echo 'Dearvvašvuođabargiilága mii gieđahallá dieđuid addima pasienttaide, ja pasientavuoigatvuođalága § 1-3, mii gieđahallá vuoigatvuođa informerejuvvon miehtamii, leat ovdamearkkat dakkár vuoigatvuođain.' | preprocess --abbr=sme/bin/abbr.txt | l
, ja pasientavuoigatvuođalága § 1-3 , mii . gieđahallá vuoigatvuođa informerejuvvon miehtamii
Date: 2012-05-08 20:12:30 +0200
From: Lene Antonsen <
Jeg har testa med versjon 53888 av preprocess, og med den får jeg ikke slike bugger som er nevnt i kommentarene her.
Date: 2012-05-08 20:29:07 +0200
From: Trond Trosterud <
Den eine skilnaden mellom 53888 og neste versjon (53890) er denne:
For meg ser smultringen ° ut som rusk, er det det?
Date: 2012-05-08 21:27:57 +0200
From: Børre Gaup <
Smultringen er ikke rusk, jfr. commitmeldinger i 53889 og 53890 Modified: trunk/tools/abbrtester/abbrtester.py Log: Degree sign behind numeral and dot wreaks havoc og Modified: trunk/gt/script/preprocess Log: Fixed numeral+dot followed by degree sign
Date: 2012-05-08 22:50:17 +0200
From: Trond Trosterud <
Ok, så veit vi det. men det er likevel frå og med denne innsjekkinga (eller neste?) at preprocess ikkje fungerer. Mitt framlegg er at du ser på denne bugen, i og med at du sjølv best hugsar kva du har gjort og korfor.
Date: 2012-05-09 16:13:23 +0200
From: Lene Antonsen <
script$ svn ci -m "Tilbake til r53888 inntil videre, pga bug #1342. Vi må ha noe som fungerer." preprocess Sending preprocess Transmitting file data . Committed revision 58499.
Jeg gjorde det slik fordi jeg/vi er avhengig av en preprocess som fungerer.
Date: 2012-05-10 07:15:34 +0200
From: Lene Antonsen <
(In reply to comment #10)
script$ svn ci -m "Tilbake til r53888 inntil videre, pga bug #1342. Vi må ha noe som fungerer." preprocess Sending preprocess Transmitting file data . Committed revision 58499.
Jeg gjorde det slik fordi jeg/vi er avhengig av en preprocess som fungerer.
Denne versjonen fungerer mye bedre enn den nyeste. men det er fremdeles en bug.
I følgende tekst preprosserer ru. feil i begge setningene:
Vuosttaš logi minuvtta lei buorre áigodat Nordlysa ektui. Vuoiti oažžu 500 ru. Juohke heasta borrá sullii 6 kilu suinniid beaivái. Sus leat golbma oappá. Mun oasttán guokte girjji. Son lea guoktelogi jagi boaris. Sin stáljas leat vihtta dámmá ja okta ore. Vuoiti oažžu 500 ru. Juohke heasta borrá sullii 6 kilu suinniid beaivái.
Men når jeg fjerner den første setningen, så preprosserer ru. riktig i begge setningene (!):
Vuoiti oažžu 500 ru. Juohke heasta borrá sullii 6 kilu suinniid beaivái. Sus leat golbma oappá. Mun oasttán guokte girjji. Son lea guoktelogi jagi boaris. Sin stáljas leat vihtta dámmá ja okta ore. Vuoiti oažžu 500 ru. Juohke heasta borrá sullii 6 kilu suinniid beaivái.
Date: 2012-05-11 11:35:04 +0200
From: Børre Gaup <
Some more research on when the mii. bug was introduced shows that it was introduced in commit 55394, it has the commit message: Hack to break the infinite loop preprocess went into an infinite loop if it found an idiom without an ending punctum, but in fact expected an abbreviation with an ending punctum.
Date: 2012-05-11 16:58:12 +0200
From: Børre Gaup <
The mii bug has been fixed in commit 58606
Date: 2012-05-11 21:31:56 +0200
From: Børre Gaup <
Added tests and solutions for the rest of the problems mentioned in this bug report in commits 58612, 58613, 58614
This issue was created automatically with bugzilla2github
Bugzilla Bug 1342
Date: 2012-05-05T01:11:05+02:00 From: Trond Trosterud <>
To: Børre Gaup <>
CC: borre.gaup, lene.antonsen, sjur.n.moshagen, trond.trosterud
Last updated: 2012-05-11T21:31:56+02:00