Closed albbas closed 10 years ago
Date: 2011-05-06 06:03:42 +0200
From: Trond Trosterud <
To find, run this command:
~$ccat -l sma -r *corpus/converted/sma | sort | uniq -d|grep '[A-Za-z]'|wc -l 3554
(in apache, I get 3992, probably older make clean there)
To find the files themselves:
Look at the output of the previous command, and pick random strings:
apache_corpus$grep "Aerebi provhkim baahtjetjidie lohkedh" corpus/converted/sma/// | cut -d":" -f1 boundcorpus/converted/sma/facta/other_files/lierehtimmie_4.doc.xml boundcorpus/converted/sma/facta/other_files/lierehtimmie_4_til_trykk.doc.xml
The number 3554 does not represent too many files. The lierehtimmie files are 813 lines long, so already 1/5 of the error.
The doublet should certainly not be removed, it seems there are different versions of each other:
boundcorpus$wc converted/sma/facta/other_files/lierehtimmie_4* 8715 15341 156203 converted/sma/facta/other_files/lierehtimmie_4.doc.xml 8721 15359 156489 converted/sma/facta/other_files/lierehtimmie_4_til_trykk.doc.xml
Rather, the one earlier in the production chain (the one not "til trykk" in this case) should be moved to gold corpus.
TODO: Track down and fix the 3554 sentences and remove doublet files (same content, different name). Files being versions of each other should have the older files go to goldcorpus or some special corner.
Date: 2011-05-06 06:09:58 +0200
From: Trond Trosterud <
To give an impression: Such sentences several times:
5 Dåarjoen voestes bielie galka maaksasovvedh gosse libie åadtjome konto- jih åårganisasjovnenummerem. ¶
5 Dåarjoedåastoje tjuara dejtie krïebpesjh gïehtjedidh: ¶
5 Dåarjoedåastoje ryökneme-buerkestimmine tjuara gïehtelidh beetnehnuhtjemen gaavhtan. Jis beetnegh nuhtjesuvvieh baalhkese jih honorarese, dle dåarjoedåastojen barkoevedtije-diedte jih tjuara siejhmetji gïehtelidh åvtelh-bodti geaseminie jih geehtestidh baalhka- jih geaseme-laavenjassh tjïelten beetnehreerijasse. ¶
5 Daesnie gaajhke gåarede - saaht guktie ¶
3 Manne leam ånnetji rovneges aaj, manne dejtie båeries artistidie goltelem, ”Rolling Stones” jih »Janis Joplin”. ¶
3 -Manne leam gaektsien jaepien båeries. ¶
3 Manne leam akte onne kaarretje, manne leam unnemes dennie mov klaassesne. Mov lea guhkies jovje voepth. ¶
3 Manne jijnjh aath lyjhkem mejgujmie noerh barkeminie., Manne ¶
2 Åvlan taxi-sijjesne vuejijh daerpies. ¶
2 Åvla monnen deerpegidie måjhtiji jih olkese vijleli. Dellie vøøjni dihte gåmma dejtie voesside gåatan doereminie jih voejngesistie govli dihte løøvles disse. Gullie-værjoe klienjedi jalhts dålle aernesne jamhkaminie. ¶
2 Åvla: Mejtie maahtah munnjien naan guelieh doekedh? ¶
Then some do of course reoccur without being file doublets:
68 Baakoeh ¶
64 Tjielkestimmieh ¶
41 BAAKOEH ¶
40 Lohkehtimmien ulmieh leah; learohkh gelkieh ¶
39 NSR sæjhta: ¶
38 Lohkeme ¶
22 Lohkehtimmesne gelkieh learohkh ¶
20 Raahkele-Piere: ¶
20 Lohkh vielie daaroen: ¶
20 Lohkehtimmien ulmie leah; learohkh gelkieh ¶
18 Nee naa nee naa nee naa nov, ¶
17 -Jaavoe. ¶
17 BAAKOELÆSTOE ¶
16 Jaahkenelkien Aanna ¶
15 Maanah: ¶
14 Tijje ektesne soptsestalledh ¶
11 Jaepietsiehkieh Öörnege ¶
9 Åarjelsaemien gielekuvsje ¶
9 Lohkedh jih tjaeledh ¶
9 Goltelidh jih soptsestidh ¶
9 Giele- jih kultuvremaahtoe ¶
9 Buerie biejjie! ¶
8 Soptsesth guktie saemienskåvlosne vaedtsedh? ¶
8 Saemesth amma! ¶
8 guktie ¶
8 Datne: ………………………………………………………………………………….. ¶
7 vaedtsedh ¶
7 Skodth mov gaavalohke ¶
7 NSR-n mïelesne: ¶
7 dan ¶
Date: 2011-05-06 06:21:41 +0200
From: Trond Trosterud <
This command tells me there are 7 file pairs of exactly the same size in boundcorpus/converted/sma/facta/otherfiles/:
ls -l boundcorpus/converted/sma/facta/other_files/|cut -d" " -f5-|cut -c1-8|uniq -c|sort -nr|l
To have a look:
~$ls -l boundcorpus/converted/sma/facta/other_files/| grep ' 7046 ' -rw-rw-r-- 1 trond trond 7046 mai 5 11:20 Kap_10_3_Rovnigs_hieje.doc.xml -rw-rw-r-- 1 trond trond 7046 mai 5 11:20 Soete_rovnigs_hiejen_bijre.doc.xml
~$ls -l boundcorpus/converted/sma/facta/other_files/| grep ' 2929 ' -rw-rw-r-- 1 trond trond 2929 mai 5 11:17 guktie_snåasesne_årrodh_2.3.doc.xml -rw-rw-r-- 1 trond trond 2929 mai 5 11:17 guktie_snåasesne_årrodh.doc.xml
etc.
So, what is the moral of this?
Átnot, de oažžubehtet. Ohcet, de gávdnabehtet. Goalkkuhehket, de didjiide rahppojuvvo uksa.
Date: 2011-05-06 06:57:49 +0200
From: Ciprian Gerstenberger <
(In reply to comment #2)
This command tells me there are 7 file pairs of exactly the same size in boundcorpus/converted/sma/facta/otherfiles/:
ls -l boundcorpus/converted/sma/facta/other_files/|cut -d" " -f5-|cut -c1-8|uniq -c|sort -nr|l
To have a look:
~$ls -l boundcorpus/converted/sma/facta/other_files/| grep ' 7046 ' -rw-rw-r-- 1 trond trond 7046 mai 5 11:20 Kap_10_3_Rovnigs_hieje.doc.xml -rw-rw-r-- 1 trond trond 7046 mai 5 11:20 Soete_rovnigs_hiejen_bijre.doc.xml
~$ls -l boundcorpus/converted/sma/facta/other_files/| grep ' 2929 ' -rw-rw-r-- 1 trond trond 2929 mai 5 11:17 guktie_snåasesne_årrodh_2.3.doc.xml -rw-rw-r-- 1 trond trond 2929 mai 5 11:17 guktie_snåasesne_årrodh.doc.xml
etc.
So, what is the moral of this?
Átnot, de oažžubehtet. Ohcet, de gávdnabehtet. Goalkkuhehket, de didjiide rahppojuvvo uksa.
What should we learn from that?
Date: 2011-05-09 10:20:56 +0200
From: Trond Trosterud <
This is an easy bug to fix. Fix it.
Date: 2011-05-09 10:48:18 +0200
From: Trond Trosterud <
Sorry for the last comment, I did not notice the frustration that shone through. Let me reformulate myself in another forum.
Date: 2011-05-09 12:48:14 +0200
From: Tomi Pieski <
I ran a command that checked the files md5sum and printed out duplicates. The command:
find ./corpus/converted/sma -name .xml -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find ./corpus/converted/sma -name ".xml" -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Result:
d80848dfeb268359d3aa485bb322ab72 ./boundcorpus/converted/sma/facta/other_files/guktie_snåasesne_årrodh.doc.xml d80848dfeb268359d3aa485bb322ab72 ./boundcorpus/converted/sma/facta/other_files/guktie_snåasesne_årrodh_2.3.doc.xml
f1521de7e6b6e381f89454ef118fc397 ./boundcorpus/converted/sma/facta/other_files/Voestes_skuvle_biejjie_1.2.doc.xml f1521de7e6b6e381f89454ef118fc397 ./boundcorpus/converted/sma/ficti/Voestes_skuvle_biejjie_1.2.doc.xml
0582bdf3aed32b5637add8818b70798a ./boundcorpus/converted/sma/admin/depts/SD_060_gærjabusse.doc.xml 0582bdf3aed32b5637add8818b70798a ./boundcorpus/converted/sma/facta/other_files/SD_060_gærjabusse.doc.xml
3c7b82d1fea1d7f180672099210cf30e ./boundcorpus/converted/sma/facta/other_files/Manne_miesiem_mierhkesje.doc.xml 3c7b82d1fea1d7f180672099210cf30e ./boundcorpus/converted/sma/facta/other_files/Manne_miesiem_mierhkesje_3.1.doc.xml
3d5554578eff2341fcc6fd806fd93410 ./boundcorpus/converted/sma/facta/other_files/Bilde_Dikt.doc.xml 3d5554578eff2341fcc6fd806fd93410 ./boundcorpus/converted/sma/facta/other_files/Bilde_Dikt_3.1.doc.xml
3e8d6edd40ed06c63c63f4b1c65b9093 ./boundcorpus/converted/sma/admin/depts/SD_060-07_støtte_samiske_teaterformål_2008.doc.xml 3e8d6edd40ed06c63c63f4b1c65b9093 ./boundcorpus/converted/sma/facta/other_files/SD_060-07_støtte_samiske_teaterformål_2008.doc.xml
339426d928372a3b574f537fa5c4927e ./boundcorpus/converted/sma/admin/depts/Raeriestimmie_straategijehke_soejkesjen_åvteste_learoevierhtieh.doc.xml 339426d928372a3b574f537fa5c4927e ./boundcorpus/converted/sma/facta/other_files/Raeriestimmie_straategijehke_soejkesjen_åvteste_learoevierhtieh.doc.xml
4280c8d06a0936529e5264a2f813fe20 ./boundcorpus/converted/sma/admin/depts/SD_060-07_driftsstøtte_til_samiske_språksentre_2008.doc.xml 4280c8d06a0936529e5264a2f813fe20 ./boundcorpus/converted/sma/facta/other_files/SD_060-07_driftsstøtte_til_samiske_språksentre_2008.doc.xml
66d6c1ea6683235664ca4e287c110185 ./boundcorpus/converted/sma/facta/other_files/G_baakoeh.doc.xml 66d6c1ea6683235664ca4e287c110185 ./boundcorpus/converted/sma/facta/other_files/G_baakoeh_3.1.doc.xml
8cf43b72c93567c34d8754841196a448 ./boundcorpus/converted/sma/facta/other_files/Gieries_Krihke_1.3.doc.xml 8cf43b72c93567c34d8754841196a448 ./boundcorpus/converted/sma/ficti/Gieries_Krihke_1.3.doc.xml
82a6162806418affe8c55024c23656fe ./boundcorpus/converted/sma/facta/other_files/påaske.doc.xml 82a6162806418affe8c55024c23656fe ./boundcorpus/converted/sma/facta/other_files/påaske_2.6.doc.xml
821b05126bd21aa529f0108d92fa7ce4 ./boundcorpus/converted/sma/facta/other_files/skjæra_og_tiur.doc.xml 821b05126bd21aa529f0108d92fa7ce4 ./boundcorpus/converted/sma/facta/other_files/skjæra_og_tiur_Norsk_5.1.doc.xml
833b0658289bb50f37037fc42f89823b ./boundcorpus/converted/sma/facta/other_files/Mov_skovtere.doc.xml 833b0658289bb50f37037fc42f89823b ./boundcorpus/converted/sma/facta/other_files/Mov_skovtere_2.5.doc.xml
907fbe3fe8f2af0986d64d5f984702a8 ./boundcorpus/converted/sma/facta/other_files/_1.1.doc.xml 907fbe3fe8f2af0986d64d5f984702a8 ./boundcorpus/converted/sma/ficti/Voestes_B_1.1.doc.xml
Date: 2011-05-09 12:51:20 +0200
From: Trond Trosterud <
Good. This is even more than what I found. Could you remove the doublet files? Then it will be easier to see what is doublet text in other files.
Date: 2011-05-09 17:58:12 +0200
From: Trond Trosterud <
Principles for removing doublet files:
Date: 2011-05-16 23:23:46 +0200
From: Trond Trosterud <
What is the status on the doublet issue now?
Date: 2011-05-22 08:34:22 +0200
From: Trond Trosterud <
Quoting myself from May 9th, in a comment to Tomi:
"Good. This is even more than what I found. Could you remove the doublet files? Then it will be easier to see what is doublet text in other files."
So, Tomi, did you remove those files?
It seems nothing has been done, since the initial command gives the same number of doublet sentences today:
ccat -l sma -r *corpus/converted/sma | sort | uniq -d|grep '[A-Za-z]'|wc -l 3554
(For apache, I now get 3488 identical sentences, last time I got 3992.)
Date: 2011-06-22 22:06:35 +0200
From: Trond Trosterud <
The number of doublet sentences is slowly increasing, last time 3554:
~$ccat -l sma -r *corpus/converted/sma | sort | uniq -d|grep '[A-Za-z]'|wc -l 3565 ~$date ons jun 22 22:04:25 CEST 2011
Date: 2011-09-22 07:15:02 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #6)
I ran a command that checked the files md5sum and printed out duplicates. The command:
find ./corpus/converted/sma -name .xml -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find ./corpus/converted/sma -name ".xml" -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Result:
I reran the command (which only works on victorio, NOT the XServe), and the result is now:
ZERO
No duplicate files according to this test.
Can we close this bug?
Date: 2011-09-22 09:11:46 +0200
From: Trond Trosterud <
We measure different things. See the title: sentences.
apache_corpus$ccat -l sma -r *corpus/converted/sma | grep '[A-Za-z]'|sort|uniq -d|wc -l 3534
So, we is down 20 :-)
If Tomi's test is is ok, we have no identical files. which is good.
But the problem is that we have several versions of almost identical files. What I measured was identical <-sentences, and here the number has (almost) not changed.
Top-15:
74 Lohkeme ¶
68 Baakoeh ¶
64 Tjielkestimmieh ¶
41 BAAKOEH ¶
40 Lohkehtimmien ulmieh leah; learohkh gelkieh ¶
39 NSR sæjhta: ¶
22 Lohkehtimmesne gelkieh learohkh ¶
20 Raahkele-Piere: ¶
20 Lohkh vielie daaroen: ¶
20 Lohkehtimmien ulmie leah; learohkh gelkieh ¶
18 Nee naa nee naa nee naa nov, ¶
17 -Jaavoe. ¶
17 BAAKOELÆSTOE ¶
16 Jaahkenelkien Aanna ¶
15 Maanah: ¶
14 Tijje ektesne soptsestalledh ¶
Some long matches:
5 Gieline gaavnedidh ¶
5 Dåarjoen voestes bielie galka maaksasovvedh gosse libie åadtjome konto- jih åårganisasjovnenummerem. ¶
5 Dåarjoedåastoje tjuara dejtie krïebpesjh gïehtjedidh: ¶
5 Dåarjoedåastoje ryökneme-buerkestimmine tjuara gïehtelidh beetnehnuhtjemen gaavhtan. Jis beetnegh nuhtjesuvvieh baalhkese jih honorarese, dle dåarjoedåastojen barkoevedtije-diedte jih tjuara siejhmetji gïehtelidh åvtelh-bodti geaseminie jih geehtestidh baalhka- jih geaseme-laavenjassh tjïelten beetnehreerijasse. ¶
5 Daesnie gaajhke gåarede - saaht guktie ¶
(...) 4 Soptsesth dov bïjre. Guktie dov nomme, man båeries leah jih gusnie årroeh: ¶ 4 Snåase tjïelte jïh Noerhte-Trøøndelaagen fylhkentjïelte leah guektiengïelen tsiengelen 1. biejjien 2008 raejeste, jïh Snåase tjïelte lea dehtie miereste meatan sjidteme reeremedajven saemien gïelese. ¶
So, 4 and 5 instances of these long sentences indicates to me that we still have some identical files.
The 74 instances of "Lohkeme" is probably ok (corpus contains textbooks), but 5x "Dåarjoedåastoje ryökneme-buerkestimmine tjuara gïehtelidh beetnehnuhtjemen gaavhtan. Jis beetnegh nuhtjesuvvieh baalhkese jih honorarese, dle dåarjoedåastojen barkoevedtije-diedte jih tjuara siejhmetji gïehtelidh åvtelh-bodti geaseminie jih geehtestidh baalhka- jih geaseme-laavenjassh tjïelten beetnehreerijasse. "...?
Date: 2011-09-22 10:07:19 +0200
From: Sjur Nørstebø Moshagen <
$ grep -r -l "Dåarjoen voestes bielie galka maaksasovvedh gosse libie åadtjome konto- jih åårganisasjovnenummerem." *
admin/depts/SD_060-07_Darjomedåarjoe_saemien_kultuvre-gåetide_2008.doc.xml admin/depts/SD_060-07_støtte_samiske_teaterformål_2008.doc.xml admin/depts/SD_060_gærjabusse.doc.xml facta/other_files/kultuvre-gåetide_2.doc.xml facta/other_files/SD_060-07_Darjomedåarjoe_saemien_kultuvre-gåetide_2008.doc.xml facta/other_files/SD_060-07_støtte_samiske_teaterformål_2008.doc.xml facta/other_files/SD_060_gærjabusse.doc.xml facta/other_files/SijtiJarnge-_klar.doc.xml
So, with one of those 5x sentences I got the above candidates. Some are obviously not duplicates, at least based on the filename, but some seems to be quite good candidates for removal. The only way to check is to diff the candidate duplicates, and look at them manually, and then decide.
Date: 2012-02-07 14:49:19 +0100
From: Trond Trosterud <
P1 bug for Tomi. The issue is still there. Tomi, could you suggest a priority reflecting realities? Afterwards, we may discuss whether it is high enough.
Date: 2012-08-16 18:01:32 +0200
From: Trond Trosterud <
ccat -l sma -r *corpus/converted/sma | sort | uniq -d|grep '[A-Za-z]'|wc -l 543
We are thus down 3000, but still over 500 to go. P1?
Date: 2012-09-07 19:58:05 +0200
From: Trond Trosterud <
Sjur: Any suggestion as to priority?
Date: 2012-09-10 15:05:09 +0200
From: Tomi Pieski <
I went through files: boundcorpus/converted/sma/facta/other_files/Kap_10_7.DOC.xml boundcorpus/converted/sma/facta/other_files/AKTEPJ~1.DOC.xml boundcorpus/converted/sma/facta/other_files/Jarhpoeh_10_7.doc.xml
Between first two the AKTEPJ~1.DOC.xml has more elements which are like:
JON ISAK GÆLOK
<p xml:lang="kal">
Laara:
And then they difffer in some small parts. Element attributes differ, but not the content.
And then diff like:
-Datne vienhth Shakespeare åehpies orremejis daelie vyøseme??
VS.
-Datne vienhth Shakespeare åehpies orreme
<p>
<em type="italic">jis </em>
<em type="italic">daelie vyøseme</em>
<em type="italic">??</em>
</p>
Kap_10_7.DOC.xml and Jarhpoeh_10_7.doc.xml seem to differ in that other uses swedish äö and the other norwegian æø:
<p xml:lang="nob">
Gøøkte keelnerh sinsitninie soptsestigan:
VS.Göökte keelnerh sinsitninie soptsestigan:
And also different amount of whitespace in text. And they also differ when uppercasing names:
<p>
TJIDTJIE: Tåamma, båetieh gåatan!
VS.Tjidtjie: Tåamma, båetieh gåatan!
And they differ also in content:
<p>
-Mannasinie idtji Jeense datnem åadtjoeh geadtan
dåeriedidh?
VS.
- Mannasinie idtji Jeense datnem åadtjoeh giedtien gåajkoe dåeriedidh?
And:
<p>
Piere: Strååffelåsta jis ij maam
dorjeme?
VS.
Piere: Strååffelostem jis im leah maam dorjeme?
And also in compound hyphenation:
<p>
-Mov hov lea klaasetjelmieh.
VS.-Mov hov lea klaase-tjelmieh.
Weirrd:
-Dellie manne dihte tjelmie-klaash daarpesjem.
VS.
-Dellie manne dihte klaase-tjelmieh daarpesjem.
The three files seem to be almost the same with content. Some diffs with äö vs. æø, use of whitespace and how the content is divided in elements.
Should we remove one or two of the files, and if so, which one..
Date: 2012-09-10 15:22:55 +0200
From: Tomi Pieski <
<wordcount>23</wordcount>
22a23,25 Maam Laara tjeeli påasken bijre:
other_files$wc -l påasken_bijre_2.6.doc.xml påasken_bijre.doc.xml 35 påasken_bijre_2.6.doc.xml 38 påasken_bijre.doc.xml 73 yhteensä other_files$
The first file can be removed?
Date: 2012-09-10 15:33:35 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #18)
I went through files: boundcorpus/converted/sma/facta/other_files/Kap_10_7.DOC.xml boundcorpus/converted/sma/facta/other_files/AKTEPJ~1.DOC.xml boundcorpus/converted/sma/facta/other_files/Jarhpoeh_10_7.doc.xml
My suggestion is that we keep only one of them, and that we keep the one that seems "cleanest".
And then they difffer in some small parts. Element attributes differ, but not the content.
And then diff like:
-Datne vienhth Shakespeare åehpies orremejis daelie vyøseme??
VS.
-Datne vienhth Shakespeare åehpies orreme
<p> <em type="italic">jis </em> <em type="italic">daelie vyøseme</em> <em type="italic">??</em> </p>
Of these two, I would prefer the former.
Kap_10_7.DOC.xml and Jarhpoeh_10_7.doc.xml seem to differ in that other uses swedish äö and the other norwegian æø:
<p xml:lang="nob">
Gøøkte keelnerh sinsitninie soptsestigan:
VS.Göökte keelnerh sinsitninie soptsestigan:
ö is korrekt, and æ is korrekt. If they use either æø or äö, then the two files are equally bad in this respect.
And also different amount of whitespace in text. And they also differ when uppercasing names:
<p>
TJIDTJIE: Tåamma, båetieh gåatan!
VS.Tjidtjie: Tåamma, båetieh gåatan!
I would prefer the last one, but this is not important.
And they differ also in content:
<p>
-Mannasinie idtji Jeense datnem åadtjoeh geadtan
dåeriedidh?
VS.
- Mannasinie idtji Jeense datnem åadtjoeh giedtien gåajkoe dåeriedidh?
This is serious. We need to check that we get the correct content (=what is in the printed book). Could you check with Maja? She probably has one.
<p>
Piere: Strååffelåsta jis ij maam
dorjeme?
VS.
Piere: Strååffelostem jis im leah maam dorjeme?
And also in compound hyphenation:
<p>
-Mov hov lea klaasetjelmieh.
VS.-Mov hov lea klaase-tjelmieh.
Again, we want to follow the printed original.
Weirrd:
-Dellie manne dihte tjelmie-klaash daarpesjem.
VS.
-Dellie manne dihte klaase-tjelmieh daarpesjem.
The three files seem to be almost the same with content. Some diffs with äö vs. æø, use of whitespace and how the content is divided in elements.
Should we remove one or two of the files, and if so, which one..
Yes, we should, we should keep only the best one. And this goes for all such duplicates.
Date: 2012-09-10 15:35:33 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #19)
sma$cd facta/other_files/ other_files$diff påasken_bijre_2.6.doc.xml påasken_bijre.doc.xml 11c11 <
18 <wordcount>23</wordcount>
22a23,25 Maam Laara tjeeli påasken bijre: [...] The first file can be removed?
Yes.
Date: 2012-10-23 11:52:34 +0200
From: Sjur Nørstebø Moshagen <
Changing Assignee to Børre, who is the main corpus maintainer. Also reduced priority a bit, to what seems more reasonable given that we have managed since January without it being fixed.
Date: 2013-04-23 09:43:30 +0200
From: Trond Trosterud <
The situation is actually worsening. We started out with 3554 in may 2011, but now the same command gives:
ccat -l sma -r *corpus/converted/sma | sort | uniq -d|grep '[A-Za-z]'|wc -l 5881
Date: 2014-08-06 16:27:12 +0200
From: Børre Gaup <
I have removed duplicate files, and as of now there are 1230 duplicate strings.
There don't seem to be any more duplicate files, although some of the files in sma/facta/other_files contain the same stories.
This issue was created automatically with bugzilla2github
Bugzilla Bug 1005
Date: 2011-05-06T06:03:42+02:00 From: Trond Trosterud <>
To: Børre Gaup <>
CC: ciprian.gerstenberger, lene.antonsen, maja.l.kappfjell, sjur.n.moshagen, thomas.omma, tomi.k.pieski
Last updated: 2014-08-06T16:27:12+02:00