Closed albbas closed 13 years ago
Date: 2011-06-14 11:28:06 +0200
From: Trond Trosterud <
Looking at the first OTP in stable (r1559), we find the following:
Files 1, 4, 6, 9, 10, 11, 12 are ok.
Files 2, 3, 5, 7, 8 contain lists of the following format:
rett til opphold med rein og til ferdsel, flytting og flyttleier
These lists are not converted to xml, they are thus missing.
This is probably a variant of the earlier "no header bug", and if so, the solution is the same: Rewrite the conversion script to include lists as well.
Date: 2011-06-14 15:59:22 +0200
From: Sjur Nørstebø Moshagen <
I'm working on this, and I believe that I have solved the missing list issue. I do still have one problem left, which is of a slightly opposite nature:
the double list problem. Nested lists appear twice, once as pure texts within the parent list element, and secondly as a proper list within the list item. As soon as I have resolved this, I will convert the documents anew, verify that the conversion is an improvement, and commit to the stable corpus branch.
Date: 2011-06-14 17:45:00 +0200
From: Sjur Nørstebø Moshagen <
There are actually two quite fundamental issues at play here, both bugs:
1) doing paragraph-level corrections deletes nested elements like lists within lists 2) using the string-correction template duplicates the input strings
Each one of these two are quite serious bugs, but only show up if you uncomment the paragraph-level corrections commented out at the end of each xsl file.
As such, the bugs are probably not affecting a lot of files, but they of course needs to be solved. I will now outcomment these blocks for the files already in stable/, such that we can get a new conversion with hopefully close to acceptable quality.
I will also file separate bugs for the two issues at hand (and making this bug a dependant of those new bugs - that means this one can't be closed before both those bugs are fixed).
Date: 2011-06-14 18:29:04 +0200
From: Sjur Nørstebø Moshagen <
Hm, I'm not so sure. Checking with our corpus DTD, I think it says that lists can be nested within lists, but not within listitems, as you can do in html. D****!!! I thought I was so close - now it is back to square one.
Sorry, no updated files today - at least not yet. I have to rest and resettle before I can try to redo these things again, now in accordance with our DTD.
No new bugs filed - there's no point as long as the structure is invalid (but xmllint didn't complain, which was kind of strange).
Date: 2011-06-15 07:39:00 +0200
From: Sjur Nørstebø Moshagen <
Got the xml structure for nested lists correct, and got the content in there as well. Getting closer...
Still some content is missing, namely divs containing spans and a elements mixed with PCDATA (mixed content. Only the text directly within the div is converted, the rest of the text is still lost.
Date: 2011-06-15 12:31:01 +0200
From: Sjur Nørstebø Moshagen <
With help from Ciprian, the final issues with the empty element conversion were solved. The conversion now seems mostly fine, with the following notes:
File 8: italic elements within p vanishes, which leads to a couple of cases of run-together words (ie the original does not contain any space between the tag and the following word, and after conversion there is no way to separate the two words anymore). See "guovllustivrratoaivvildit" in section 8.2 of that file for an example.
File 12: all lists are extended beyond its real content, including the following two paragraphs as regular list item content. This happens in §§ 25, 32, 47, 50, 54, 57, 62, 79.
There are some occurrences of wrong language encoding.
I'll compare with the previous version in the corpus repository, and then commit the conversion result I have at the moment if this new version is an improvement. Please note that the committed files will still have the issues above.
Date: 2011-06-15 12:55:27 +0200
From: Sjur Nørstebø Moshagen <
Sme files committed. Please check. I'll convert the nob files now.
Date: 2011-06-15 13:31:26 +0200
From: Berit Nystad Eskonsipo <
I have checked file nr 2 and I see that the lists above the text are also converted to xml. In the previous version this part was skipped. I don't think this will be a problem because the nob version also has these lists - in nob.
Here are the lists I'm talking about: Mana ohcamii Mana sisdollui Mana bajimus navigašuvdnii Mana báikkálaš navigašuvdnii Bokmål Nynorsk Sámegiella English
ráđđehus.no Stoltenberga nubbi ráđđehus Departemeanttat Stáhtaministara kantuvra Bargodepartemeanta Mánáid-, dásseárvo- ja searvadahttindepartemeanta Ruhtadandepartemeanta Guolástus- ja riddodepartemeanta Ođasmahttin-, hálddahus- ja girkodepartemeanta Suodjalusdepartemeanta Dearvvašvuođa- ja fuolahusdepartemeanta Justis- ja politidepartemeanta Gielda- ja guovlodepartemeanta Kulturdepartemeanta Máhttodepartemeanta Eanandoallo- ja biebmodepartemeanta Birasgáhttendepartemeanta Ealáhus-ja gávpedepartemeanta Oljo- ja energiijadepartemeanta Johtalusdepartemeanta Olgoriikadepartemeanta
Oza Eanandoallo- ja biebmodepartemeantta siidduin Oza olles ráđđehusa.no:as Juogat/CavgilGovčča Čálánhápmi
Fáddá A-Å Neahttakarta Hjelp Oktavuohta Dás don leat: Eanandoallo- ja biebmodepar... < Dokumeanttat < Proposisjoner og meldinger < 2 Láhkaevttohusa duogáš
(In reply to comment #6)
Sme files committed. Please check. I'll convert the nob files now.
Date: 2011-06-15 14:22:56 +0200
From: Sjur Nørstebø Moshagen <
Nob files commited. Changing the priority a bit, I have to do other things now, and the material should be good enough for others to look at.
We could consider closing this bug, since the original issue has been solved.
(In reply to comment #7)
I have checked file nr 2 and I see that the lists above the text are also converted to xml. In the previous version this part was skipped. I don't think this will be a problem because the nob version also has these lists - in nob.
Yes, the lists are converted in both versions. By default, it is an error if such lists are not converted (ie it is a sign that something is wrong with the conversion). Nevertheless, I discussed these lists with Trond, and we agreed to take a conservative approach: lists (and template content, often in other languages) that are clearly not part of the text, and clearly of no use to the work with Sámi Language technology should be skipped if such parts can be uniquely identifed.
Here are the lists I'm talking about: Mana ohcamii Mana sisdollui Mana bajimus navigašuvdnii Mana báikkálaš navigašuvdnii
This list can be useful in a future localisation project - converted in all languages.
Bokmål Nynorsk Sámegiella English
Such lists as this one are almost never useful, and as you'll see, the language list is not converted in the newest version in stable/.
ráđđehus.no Stoltenberga nubbi ráđđehus Departemeanttat Stáhtaministara kantuvra Bargodepartemeanta Mánáid-, dásseárvo- ja searvadahttindepartemeanta Ruhtadandepartemeanta Guolástus- ja riddodepartemeanta Ođasmahttin-, hálddahus- ja girkodepartemeanta Suodjalusdepartemeanta Dearvvašvuođa- ja fuolahusdepartemeanta Justis- ja politidepartemeanta Gielda- ja guovlodepartemeanta Kulturdepartemeanta Máhttodepartemeanta Eanandoallo- ja biebmodepartemeanta Birasgáhttendepartemeanta Ealáhus-ja gávpedepartemeanta Oljo- ja energiijadepartemeanta Johtalusdepartemeanta Olgoriikadepartemeanta
This list is definitely useful, and should always be converted. The existence of such lists might be problematic in certain contexts (e.g. it will skew some types of statistics), but we'll have to deal with those issues outside of the conversion process.
Oza Eanandoallo- ja biebmodepartemeantta siidduin Oza olles ráđđehusa.no:as Juogat/CavgilGovčča Čálánhápmi
Not very useful, and skipped.
Fáddá A-Å Neahttakarta Hjelp Oktavuohta
Potentially useful, thus included.
Dás don leat: Eanandoallo- ja biebmodepar... < Dokumeanttat < Proposisjoner og meldinger < 2 Láhkaevttohusa duogáš
Potentially useful, and not uniquely identifiable, thus kept.
The decision on what to include and what to exclude might seem randon, and I'm very open for discussion.
Date: 2011-06-15 15:07:57 +0200
From: Berit Nystad Eskonsipo <
I have checked the files after Sjur fixed the list problem. All files are ok, exept nr 12_nob: OTP200620070025000SE_12.html.xml
In this file the lists above the text is missing:
Om lov om reindrift (reindriftsloven)
Søk hos Landbruks- og matdepartementet
Søk på hele regjeringen.no
Du er her: < < <
Ot.prp. nr. 25 (2006-2007)
Om lov om reindrift (reindriftsloven)
Bla i dokumentet: | |
Forslag til lov om reindrift (reindriftsloven)
Kapittel 1 Innledende bestemmelser
Here is the same part in the sme file:Boazodoallolága birra
Mana ohcamii
Mana sisdollui
Mana bajimus navigašuvdnii
Mana báikkálaš navigašuvdnii
ráđđehus.no
Stoltenberga nubbi ráđđehus
Stáhtaministara kantuvra
Bargodepartemeanta
Mánáid-, dásseárvo- ja searvadahttindepartemeanta
Ruhtadandepartemeanta
Guolástus- ja riddodepartemeanta
Ođasmahttin-, hálddahus- ja girkodepartemeanta
Suodjalusdepartemeanta
Dearvvašvuođa- ja fuolahusdepartemeanta
Justis- ja politidepartemeanta
Gielda- ja guovlodepartemeanta
Kulturdepartemeanta
Máhttodepartemeanta
Eanandoallo- ja biebmodepartemeanta
Birasgáhttendepartemeanta
Ealáhus-ja gávpedepartemeanta
Oljo- ja energiijadepartemeanta
Johtalusdepartemeanta
Olgoriikadepartemeanta
Oza Eanandoallo- ja biebmodepartemeantta siidduin
Oza olles ráđđehusa.no:as
Fáddá A-Å
Neahttakarta
Hjelp
Oktavuohta
Dás don leat: Eanandoallo- ja biebmodepar... < Dokumeanttat < Proposisjoner og meldinger < Evttohus
Od.prp. nr. 25 (2006-2007)
Boazodoallolága birra
Bláđe dokumeanttas: < Ráva | | Ovdasiidu
Evttohus
Boazodoalloláhka
Kapihtal 1 Álggaheaddji mearrádusat
This bug is also in the footer: nob:Bla i dokumentet: | |
Skip left menu navigation
Landbruks- og matdepartementet, Akersgt. 59 (R5), Postboks 8007 Dep. 0030 Oslo l Tlf: 22 24 90 90, Faks: 22 24 95 55 E-post: l Nettredaksjonen: Ansvarlig redaktør: l Nettredaktør:
sme:Bláđe dokumeanttas: < Ráva | | Ovdasiidu
There will be problems when we try to parallel these document.
Date: 2011-06-15 15:12:19 +0200
From: Ciprian Gerstenberger <
I have a suggestion wrt. this
If the result of paragraph-making procedure is an empty paragraph this should be skipped altogether.
(In reply to comment #9)
I have checked the files after Sjur fixed the list problem. All files are ok, exept nr 12_nob: OTP200620070025000SE_12.html.xml
In this file the lists above the text is missing:
Om lov om reindrift (reindriftsloven)
Søk hos Landbruks- og matdepartementet
Søk på hele regjeringen.no
Du er her: < < <
Ot.prp. nr. 25 (2006-2007)
Om lov om reindrift (reindriftsloven)
Bla i dokumentet: | |
Forslag til lov om reindrift (reindriftsloven)
Kapittel 1 Innledende bestemmelser
Here is the same part in the sme file:Boazodoallolága birra
Mana ohcamii
Mana sisdollui
Mana bajimus navigašuvdnii
Mana báikkálaš navigašuvdnii
ráđđehus.no
Stoltenberga nubbi ráđđehus
Stáhtaministara kantuvra
Bargodepartemeanta
Mánáid-, dásseárvo- ja searvadahttindepartemeanta
Ruhtadandepartemeanta
Guolástus- ja riddodepartemeanta
Ođasmahttin-, hálddahus- ja girkodepartemeanta
Suodjalusdepartemeanta
Dearvvašvuođa- ja fuolahusdepartemeanta
Justis- ja politidepartemeanta
Gielda- ja guovlodepartemeanta
Kulturdepartemeanta
Máhttodepartemeanta
Eanandoallo- ja biebmodepartemeanta
Birasgáhttendepartemeanta
Ealáhus-ja gávpedepartemeanta
Oljo- ja energiijadepartemeanta
Johtalusdepartemeanta
Olgoriikadepartemeanta
Oza Eanandoallo- ja biebmodepartemeantta siidduin
Oza olles ráđđehusa.no:as
Fáddá A-Å
Neahttakarta
Hjelp
Oktavuohta
Dás don leat: Eanandoallo- ja biebmodepar... < Dokumeanttat < Proposisjoner og meldinger < Evttohus
Od.prp. nr. 25 (2006-2007)
Boazodoallolága birra
Bláđe dokumeanttas: < Ráva | | Ovdasiidu
Evttohus
Boazodoalloláhka
Kapihtal 1 Álggaheaddji mearrádusat
This bug is also in the footer: nob:Bla i dokumentet: | |
Skip left menu navigation
Landbruks- og matdepartementet, Akersgt. 59 (R5), Postboks 8007 Dep. 0030 Oslo l Tlf: 22 24 90 90, Faks: 22 24 95 55 E-post: l Nettredaksjonen: Ansvarlig redaktør: l Nettredaktør:
sme:Bláđe dokumeanttas: < Ráva | | Ovdasiidu
There will be problems when we try to parallel these document.
Date: 2011-06-15 15:30:22 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #9)
I have checked the files after Sjur fixed the list problem. All files are ok, exept nr 12_nob: OTP200620070025000SE_12.html.xml [...] There will be problems when we try to parallel these document.
Cf svn commit r1564 in $GTFREE:
"* file 12 is not updated because of conversion errors."
That is, the nob file 12 is not yet in comparable shape. Hopefully I will be able to correct it later today or tomorrow.
(In reply to comment #10)
I have a suggestion wrt. this
If the result of paragraph-making procedure is an empty paragraph this should be skipped altogether.
I agree, as long as it is done carefully. Empty elements can also be a sign of conversion errors.
A general comment: when quoting code or earlier comments, try to not quote more than strictly necessary to convey your point. Less is more:)
Date: 2011-06-16 12:55:55 +0200
From: Sjur Nørstebø Moshagen <
The NOB file no 12 is now also successfully converted and committed (r1567), including all lists. This bug is one step further to being closed.
Date: 2011-06-16 14:09:49 +0200
From: Berit Nystad Eskonsipo <
This applies to the lists in nob and sme file no 12 - §§ 25, 32, 47, 50, 54, 57, 62, 79.
Also the text under the lists are converted as a part of the lists:
Here is the list innob no 12 - § 25:
1. brensel,
<p type="listitem">2. gammer, koier, buer
eller stillinger for oppbevaring av
løsøre og matvarer,</p>
<p type="listitem">3. teltstenger,
redskaper og enklere bruksting,</p>
<p type="listitem">4. arbeidsgjerder
(trøer, ringgjerder),</p>
<p type="listitem">5. garving.</p>
<p type="listitem">Friskt lauvtrevirke og
friske busker må ikke tas så fremt
det på stedet eller i nærheten finnes
annet virke som er tjenlig for
formålet.Skogeieren kan kreve
betaling for friske lauvtrær som tas
i privat skog, men ellers kan det
ikke kreves betaling for virke som
rettmessig blir tatt i medhold av
denne paragraf. Det skal uten opphold
gis melding til grunneieren om uttak
av trevirke som denne kan kreve
betaling for. Oppnås ikke enighet om
betalingen, kan beløpets størrelse
kreves fastsatt ved skjønn ved
jordskifteretten. Finnmarkseiendommen
kan ikke kreve betaling etter
bestemmelsene i leddet her.Så langt det
fremstiller seg som nødvendig av
hensyn til skogens bevaring,
foryngelse eller gjenvekst eller
fordi det er mangel på trevirke i
distriktet, kan Kongen ved forskrift
begrense eller helt forby uttak av
trevirke i nærmere bestemte områder
og derunder bl.a. bestemme at friskt
virke bare kan tas etter
utvising.</p>
</list>
Date: 2011-06-16 16:20:45 +0200
From: Sjur Nørstebø Moshagen <
Thanks for the feedback, I had forgotten about those, even though I noticed earlier. Working on them now.
Date: 2011-06-16 18:42:14 +0200
From: Sjur Nørstebø Moshagen <
Now also the lists in file 12 should be correct. Please check and report back.
Date: 2011-06-17 10:13:31 +0200
From: Berit Nystad Eskonsipo <
The lists in nob and sme No 12 does not end at the same place as in the orig file.
Converted: No 12 nob - list in § 25:
5. garving.
<p type="text">Friskt lauvtrevirke og ....
Orig html source: No 12 nob - list in § 25:
5. garving.
Friskt lauvtrevirke og The list in converted file should end after the last listitem, and not after the last paragrahp in the section as it does now. §§ 25, 32, 47, 50, 54, 57, 62, 79 in nob and sme No 12 contains lists of this format.
Date: 2011-06-17 10:46:48 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #16)
The lists in nob and sme No 12 does not end at the same place as in the orig file.
Yes, they do.
Converted: No 12 nob - list in § 25:
5. garving.
<p type="text">Friskt lauvtrevirke og ....
Orig html source: No 12 nob - list in § 25:
5. garving.
Friskt lauvtrevirke og
The outer ul element is there, and includes the non-list paragraphs.
The list in converted file should end after the last listitem, and not after the last paragrahp in the section as it does now.
I agree, but the problem is the input document. The output follows the input exactly - just open the html file in SubEthaEdit, choose the menu item "Format > Tidy and Pretty Print HTML", and then find one of the lists. You'll see that all of them are exactly like this.
It might be possible to work around this, but it will take quite some work, and as it is now we don't loose any content. The structure isn't 100% semantically correct, but pretty close (the paragraphs in question are not marked as listitems, but as regular text).
Also, I am a bit hesitant to do too much specific processing in a general conversion routine. I'll instead try to do this in the file-specific xsl. That might actually be a lot easier.
Date: 2011-06-17 10:58:27 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #17)
(In reply to comment #16)
The lists in nob and sme No 12 does not end at the same place as in the orig file.
Yes, they do.
But thanks for noticing the problem - I hadn't.
Also, I am a bit hesitant to do too much specific processing in a general conversion routine. I'll instead try to do this in the file-specific xsl. That might actually be a lot easier.
It was actually very easy when working with the intermediate xml instead of the source html:)
Fixed, and will be committed in a few moments.
Date: 2011-06-17 11:03:27 +0200
From: Sjur Nørstebø Moshagen <
The lists are fixed in r1583. Is this bug ready to be closed?
Date: 2011-06-17 11:33:59 +0200
From: Berit Nystad Eskonsipo <
The lists are great now! For me it looks like the bug is fixed.
Date: 2011-06-17 11:48:24 +0200
From: Sjur Nørstebø Moshagen <
This bug is fixed. Similar problems with other files should be given a new bug.
This issue was created automatically with bugzilla2github
Bugzilla Bug 1058
Date: 2011-06-14T11:28:06+02:00 From: Trond Trosterud <>
To: Sjur Nørstebø Moshagen <>
CC: berit.nystad.eskonsipo, borre.gaup, ciprian.gerstenberger, sjur.n.moshagen, tomi.k.pieski, trond.trosterud
Last updated: 2011-06-17T11:48:24+02:00