giellalt / bugzilla-dummy

0 stars 0 forks source link

Several Min Áigi files are not converted properly. (Bugzilla Bug 283) #78

Closed albbas closed 17 years ago

albbas commented 18 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 283

Date: 2006-05-10T15:08:58+02:00 From: Maaren Palismaa <> To: Saara Huhmarniemi <> CC: borre.gaup, saara.huhmarniemi, trond.trosterud

Depends on: #76, #279 Last updated: 2006-11-04T12:44:45+01:00

albbas commented 18 years ago

Comment 925

Date: 2006-05-10 15:08:58 +0200 From: Maaren Palismaa <>

These are the files: •AJ-sverre_porsanger.txt.xml •AJ-vuoddjit.txt.xml •ÅP-dynamittfond.txt.xml •ÅP-Gåte.txt.xml •ÅP-sponsoravtale.txt.xml •IU-Bb_boazodoaluseminára.txt.xml •Kronihkka-JK-terror.txt.xml •LohkkiJános+_reinjakt,_sami.txt.xml •NHM-Finnmarkoláhka.txt.xml •NHM-Rigoberta.txt.xml •sn-ealgaovttasbargu.txt.xml •Uhca-_Odda_Gaba.txt.xml •AJ-ruoná_ivdni.txt.xml •AJ-sabetskeanka.txt.xml •AJ-skábma.txt.xml •ÅP-GDG_seastin-duppalávvudeapm.txt.xml •ÅP-liikabeana •ÅP-sunniva •ÅP-telefun_geazis.txt.xml •IU-lasse_berit.txt.xml •IU-veahki_haga.txt.xml •IU-veahkihaga4.txt.xml •JK-_girjearvvostallan.txt.xml •leder6 •Leserinnlegg-_EvaNielsen.txt.xml •Lohkki-_EU.txt.xml •uhca-ÅP-romssataxi.txt.xml •Uhca-_duodji.txt.xml •20_jagi_das_ovdal007.txt.xml •AJ-deantogáttenuorat.txt.xml •AJ-play-boy.txt.xml •alm_NRK-sami_radio.txt.xml •alm_prod_sjef,_3177_tegn.txt.xml •ÅP-IDOL-Ellen_Marie_Eira.txt.xml •ÅP-oiva_ohcan-eai_gavdnan.txt.xml •ÅP-vajalduhttan_sámi_álb.txt.xml •ÅP-veagalvaldimat.txt.xml •ÅP-Walk_of_Fame.txt.xml •IU-cuovvolan.txt.xml •IU-dalve_ealgabivdu.txt.xml •IU-jápmin_karasjogas.txt.xml •KMO-measttir.txt.xml •MÁrkka.txt.xml •NHM-_ohcejoga_sátnejodiheaddji.txt.xml •AEN-setterdagsorden-_sami.txt.xml •AJ-álbmotbeaivi_suomas.txt.xml •AJ-skihppagurra_festivála.txt.xml •ÅP-dusse_okta_cevzzii.txt.xml •ÅP-intro-álbmot.txt.xml •ÅP-johkamohkemarkanat.txt.xml •åpning,sami.txt.xml •Durham.txt.xml •Emi-_samisk.txt.xml •filmspektakel,_sami.txt.xml •horoskop_uke5_sami.txt.xml •IU-boazosiehtadus4.txt.xml •IU-NYYY_boazosiehtadus.txt.xml •kaos,_sami.txt.xml •kaos.txt.xml •liv_på_flygendeteppe.txt.xml •Program-_barents_spetakel.txt.xml •seminar.txt.xml •spekulær_åpning.txt.xml •UHCA-_MÅ_MED_FREDAG!!!.txt.xml •uhca-ÅP-SGP2005.txt.xml •UHCA-spor_isnø.txt.xml •Uhca-IFI.txt.xml •Utstilling-_sami.txt.xml •WimmeSari+_samefolketsdag.txt.xml •AJ-maori_filmmat.txt.xml •AJ-moana.txt.xml •AJ-odda_filbmadahkkit.txt.xml •AJ-ruoná_ivdni.txt.xml •alm-Tana_NSR.txt.xml •alm_beaivvas.txt.xml •ÅP-dalvefestivala.txt.xml •ÅP-Håvard_Klemetsen.txt.xml •govvaraiddut_0009 •IU-beatnagat_hearggit.txt.xml •IU-NBR_nubbejodiheaddji.txt.xml •IU-ráhkkanit_boazonealgái.txt.xml •Leder_9 •LOHKKI.txt.xml •NHM-Gurut_golbma.txt.xml •NHM-Solturnering.txt.xml •PP-manaid_TV.txt.xml •20_jagi_das_ovdal010.txt.xml •AJ-katja.txt.xml •ÅP-Johkamohkki.txt.xml •HAL-_Boston-ny.txt.xml •horoskop_uke_6,_sami.txt.xml •IU-Beredskap_i_Karsjok.txt.xml •IU-Sæther_beredskap.txt.xml •leder_10 •NHM-veakki_asiai.txt.xml •20_jagi_das_ovdal011.txt.xml •AJ-_ovddesfilmmat+govva.txt.xml •AJ-ann_helene.txt.xml •AJ-giellabeassi.txt.xml •AJ-kárásjogas.txt.xml •AJ-mánát_sámedikkis.txt.xml •alm_nasj_park,_Rfylkkamanni.txt.xml •IU-Heargevuodjin.txt.xml •LOHKKI-_TERJE_TRETNES,_sami.txt.xml •lohkki_Lásses.txt.xml •lohkkinrk-ii.txt.xml •MLA-Zoya.txt.xml •NHM-Sállosa.txt.xml •NHM-luktfri_møkkaspre,_sami.txt.xml •20_jagi_dassái.txt.xml •AJ-Guttorm.txt.xml •AJ-manaid_mánna.txt.xml •AJ-per_iver_turi.txt.xml •AJ-Ravdna.txt.xml •ÅP-christer.txt.xml •ÅP-duodji-matki_aiggi_cada.txt.xml •ÅP-Min_Áigi_ovdána.txt.xml •ÅP-Zapp_me.txt.xml •HRM-Utstilling_i_Tromsö,_sami.txt.xml •IU-massan_doarjaga.txt.xml •IU-massan_doarjaga3.txt.xml •IU-Mathis_Ailu.txt.xml •Leder_13 •lohkki_NAS.txt.xml •Ny-AJ-per_iver_turi.txt.xml •PP-Erkke_Ánde_2.txt.xml •privahta_almmuhusa,_3_geardde.txt.xml •_20__jagi_das_ovdal.txt.xml •AJ-ragnild_lydia_Nystad.txt.xml •ÅP-kitok_veaddeduodji.txt.xml •ÅP-Sámedikkit_deaivvadit.txt.xml •IU-10000rein.txt.xml •IU-biedganan.txt.xml •IU-boazodoallosiehtadus.txt.xml •Kronikk-_Helga_Pedersen,_sami.txt.xml •lohkki,_16 •MLA-Hammerfeast_satnejodiheadd.txt.xml •MLA-Olli_ja_fala.txt.xml •NHM-musihkkahoavda.txt.xml •uhca-ÅP-nissonat_mahttet.txt.xml •uhca-ÅP-nordfors-MÅ_MED.txt.xml -F•crossbane,_sami.txt.xml -F•Fakta_om_drag,_sami.txt.xml •_FP-_hyperrask_bane,_sami.txt.xml •Fakta_om_drag,_sami1.txt.xml •AJ-Egil_Utsi.txt.xml •AJ-midttun.txt.xml •AJ-Nils_Utsi.txt.xml •alm_Altta_siida.txt.xml •alm_rabas_virggit,_Alta.txt.xml •ann_kultur,_318__tegn.txt.xml •ÅP-doarjjadoalut.txt.xml •Diedut_govvamuituide.txt.xml •IU-anarjoga_aidi3.txt.xml •IU-anárjoga_áidi.txt.xml •IU-anárjoga_áidi2.txt.xml •NHM-gollenieida_steira.txt.xml •NHM-kristin_áhcci.txt.xml •NHM-Odda_NRK_hoavddat.txt.xml •NHM-Rábmováhnemiid.txt.xml •NHM-Sátnejodiheaddji.txt.xml •UHCA-_coop_utbeta,_sami.txt.xml •20_jagi_DÁS_OVDAL.txt.xml •AJ-harriet.txt.xml •IU-Johttan_davas.txt.xml •IU-laikes_boazodoallit.txt.xml •IU-MNS_duhtavas.txt.xml •IU-Sponheim_KTK-as.txt.xml •leder_nr_17 •MLA-STK_okkupasjon.txt.xml •NHM-Vintertur_i_fokus,sami.txt.xml •UHCA-_NB_FREDAG!!!!.txt.xml

albbas commented 18 years ago

Comment 926

Date: 2006-05-11 10:15:32 +0200 From: Saara Huhmarniemi <>

Most of these files are now fixed. The names of the files are changed so that the preceding dot is replaced with underscore _. Some of the MinAigi files are still left unconverted (due to still some more filename&Perl&character encoding problems), and there is work to be done with the format (\@ -tags). However, all the files that are converted to xml should now be analyzable. I leave the bug open until the rest of the problems are solved.

albbas commented 18 years ago

Comment 952

Date: 2006-05-15 15:04:39 +0200 From: Saara Huhmarniemi <>

The @-tags are now taken into account in the conversion process. There may be some errors e.g. due to missing @:s in the front of the keyword in the original document. The extra xml-tags are removed as well (e.g. !q>). 2003-files are now reconverted, the other directories follow.

albbas commented 18 years ago

Comment 960

Date: 2006-05-16 15:14:12 +0200 From: Trond Trosterud <>

ccat -r zcorp/gtbound/sme/news/MinAigi/2003/ | less 10A oahppit leat dlvi mieht rhkkanan klssamtki E_landii. Sii leat ovttas vhnemiiguin _oaggn ru_aid, loaddavuovdi =========> all sámi characters are lost (dálvi, ráhkkanan, klsássamátki, Eŋlandii.

albbas commented 18 years ago

Comment 961

Date: 2006-05-16 15:26:57 +0200 From: Trond Trosterud <>

Sorry, my last message was accidently written on G5, not on victorio (hard to see the difference...). On victorio, everything works fine: Golbma lávvardaga maŋŋálágaid čájeha TV2 sámi dokumentáraid. Ihttin diibmu 13.40 lea vuosttaš oassi. Mii guovlalat sihke Norggas, Suomas, Ruoŧas ja Ruoššas. Dás lea kultuvra, nugo giella, luohti, dálkkudanvuogit ja sámi bajásgeassin guovddá žis. Gitta 1956 rádjai eai beassan sámi mánát sámástit skuvllas. Jus dahke dan de ráŋggáštuvvoje. Dasa lassin máhccat ru

albbas commented 18 years ago

Comment 962

Date: 2006-05-16 18:20:55 +0200 From: Saara Huhmarniemi <>

The first analysis was correct. There is a real problem with at least some of the 2003-files, like gtbound/sme/news/MinAigi/2003/10A_på_klassetur.txt.xml

The Sámi characters are lost somewhere during the process. I'll see what's wrong.

albbas commented 18 years ago

Comment 965

Date: 2006-05-19 14:56:51 +0200 From: Saara Huhmarniemi <>

The problem with this file is that there are no sámi characters in the file except the á:s. The process of guessing the encoding is based on counting the occurences of sámi characters, and since there are none, it fails. I now added the á to the set of tested sámi characters. It has not been there, since it often correctly encoded even if the rest of the document is not. The statistics should handle the change without errors.

albbas commented 18 years ago

Comment 990

Date: 2006-06-05 16:18:18 +0200 From: Saara Huhmarniemi <>

I'm not able to determine the encoding of the following files:

MinAigi/2003/Ássi_gáldus.txt MinAigi/2003/Eldrebølgen.doc

what do you think?

albbas commented 18 years ago

Comment 991

Date: 2006-06-06 00:12:43 +0200 From: Trond Trosterud <>

I copied the two files to my local machine and had a look at them. Eldrebølgen.doc (a file in Norwegian, btw.) opened on my local mac without problems (command "open Eldrebølgen.doc" rendered it ok in Word, and with all æøå-s in place. Why it cannot be converted I thus do not understand, it should be a routine task.

As for the Ássi_gáldus.txt, it turned out to be harder. The caron letters (š, ž) came out as ̌sˇ and zˇ, and the other ones as identical question marks. It seems the document has started out as e.g. Winsam (or even UTF-8), and then perhaps being opened in a Mac Classic version of some program. I remember seing the "delayed carons" when opening Sámi UTF-8 pages in a web browser in Mac OS 9. The real question is of course what happened to the other 5 letters. If they DO have different representation, we may dig out the correct values, but if they all are reduced to the same question mark (I don't have a hex editor), then we will have to drop this (and similar) file(s).

albbas commented 18 years ago

Comment 1020

Date: 2006-06-15 09:40:25 +0200 From: Trond Trosterud <>

Bug #307 has been marked as a duplicate of this bug.

albbas commented 18 years ago

Comment 1023

Date: 2006-06-15 09:53:06 +0200 From: Børre Gaup <>

For those documents that have so big problems with the encoding that they're not usable, I suggest we edit them manually, then regenerate them ...

albbas commented 18 years ago

Comment 1024

Date: 2006-06-15 10:01:27 +0200 From: Trond Trosterud <>

My comments to the duplicate bug did not carry over here. The thing is that the 6 Sámi letters seem to all have been replaced with the same character "_" (underscore). Of course, it is possible to read through the files, and fill in the missing characters manually, but at the moment I do not consider it a sensible way of spending time. In case future historians want to use our bases we may perhaps keep them in the orig repository, but in the derived version they are just noise, and they should not be generated We may thus mark them as "do not generate" in their respective xsl files, or (the easy solution) we may just remove them from the orig catalogue. Removing things, as a principle, feels bad (it seems someone comes and wants a second look just after we have deleted them), so if we could have a "do-not-generate"-type xsl file for them instead, it could perhaps be ok. I add my original comment from the duplicate bug here:

albbas commented 18 years ago

Comment 1026

Date: 2006-06-15 10:02:26 +0200 From: Trond Trosterud <>

It seems the following 2003 files have got the Sámi characters conflated. Cf. the following text snip, where all Sámi characters except á are represented by underscore:

10A_på_klassetur.txt.xml:

10A oahppit leat dálvi miehtá ráhkkanan klássamá tkái E_landii. Sii leat ovttas váhnemiiguin _oaggán ru_aid, loaddavuovdima ja ka fea bargguin. Lea oalle rah_amu leama_an gártadit ru_aid gok_at mátkegoluid, m uhto á_gir – ja vi__alvuo_ain leat gártadan dan maid sii dárbbá_edje mátkái. Ulb mil mátkiin lei beassat geavahit e_gelasgiela ja maiddái oahpásmuvvat eará kultu vrrain.

For 2003, it is 51 out of 1632 files., for 2004 it is 6 out of appr 6600 files. I suggest we look through the MinAigi corpus, check whether the files are garbled beyond rescue, and then remove them from the corpus (eventually keep them in the orig but blocked from being generated).

grep "__" * | cut -d":" -f1 | uniq | l

10A_på_klassetur.txt.xml alm_Allaskuvla.txt.xml alm_De_samiske_samlinger.txt.xml alm_Finnm_AP.txt.xml alm_Finnm_miljøtj.txt.xml alm_Karasjok_kom_6_feb_prog_NY.txt.xml alm_Nesseby_komm.txt.xml alm_reindr_agronom.txt.xml alm_Sami_Daiddaraddi.txt.xml alm_Sámi_giellaguovddas.txt.xml alm_Sami_Instituhtta.txt.xml almSDGBF.txt.xml alm-_SDR.txt.xml alm_suoma_samediggi.txt.xml alm_Urfolksenter.txt.xml alm_Vardobaiki.txt.xml arvvostallan2copy.txt.xml ceavgegeadgi-_notis.txt.xml deanu_cealkamus.txt.xml Dearv_Valentina.txt.xml EA_020202_manna.txt.xml EA-nordlys_ut.txt.xml Finnmarksloven.txt.xml Folk_erfolk-_kulturforskjelle.doc.xml Fredagsavisa-lohkkicalusSMM.txt.xml Giitu-_Anne_lise.txt.xml LESERINNLEGG.txt.xml lohkki-bieggamillot.txt.xml Lohkkicalus2.txt.xml Lohkkicalus.txt.xml lohkkiid-anders_jh_eira.txt.xml LOhkkiid_privahta_skuvla.txt.xml lohkki-kirku.txt.xml lujavri_aviissa_haga.txt.xml Máret_Sara-_Lohkkicalus.txt.xml Marka103.txt.xml Muitosatni.txt.xml PP-gielddahoavda_copy.txt.xml presideanta_sárdni_nr_1.txt.xml revsnesham_samisk.txt.xml riddu_riddu_samisk.txt.xml s12_Skoleportala.txt.xml s13_Garegasnjargga_bankku.txt.xml s13_LAN.txt.xml s16_Info_nuorra.txt.xml s17_Gran_Canaria.txt.xml s17_Raste.txt.xml sametinget_HASTER.txt.xml SAN-gaskasiidu_rep.txt.xml uhca-_sami_allaskuvla_ap.txt.xml uhca-_vitenskap_i_kauto.txt.xml

albbas commented 17 years ago

Comment 1173

Date: 2006-11-04 12:44:45 +0100 From: Saara Huhmarniemi <>

This bug is now finally fixed, so that the problematic files are not included in the conversion. There were 2 or 3 files in the list, where "__" was used for some other purpose.