giellalt / bugzilla-dummy

0 stars 0 forks source link

Lists are not converted in r1559 of admin/depts/other_files/OTP200620070025000SE_* (Bugzilla Bug 1058) #29

Closed albbas closed 13 years ago

albbas commented 13 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1058

Date: 2011-06-14T11:28:06+02:00 From: Trond Trosterud <> To: Sjur Nørstebø Moshagen <> CC: berit.nystad.eskonsipo, borre.gaup, ciprian.gerstenberger, sjur.n.moshagen, tomi.k.pieski, trond.trosterud

Last updated: 2011-06-17T11:48:24+02:00

albbas commented 13 years ago

Comment 4486

Date: 2011-06-14 11:28:06 +0200 From: Trond Trosterud <>

Looking at the first OTP in stable (r1559), we find the following:

Files 1, 4, 6, 9, 10, 11, 12 are ok.

Files 2, 3, 5, 7, 8 contain lists of the following format:

These lists are not converted to xml, they are thus missing.

This is probably a variant of the earlier "no header bug", and if so, the solution is the same: Rewrite the conversion script to include lists as well.

albbas commented 13 years ago

Comment 4490

Date: 2011-06-14 15:59:22 +0200 From: Sjur Nørstebø Moshagen <>

I'm working on this, and I believe that I have solved the missing list issue. I do still have one problem left, which is of a slightly opposite nature:

the double list problem. Nested lists appear twice, once as pure texts within the parent list element, and secondly as a proper list within the list item. As soon as I have resolved this, I will convert the documents anew, verify that the conversion is an improvement, and commit to the stable corpus branch.

albbas commented 13 years ago

Comment 4494

Date: 2011-06-14 17:45:00 +0200 From: Sjur Nørstebø Moshagen <>

There are actually two quite fundamental issues at play here, both bugs:

1) doing paragraph-level corrections deletes nested elements like lists within lists 2) using the string-correction template duplicates the input strings

Each one of these two are quite serious bugs, but only show up if you uncomment the paragraph-level corrections commented out at the end of each xsl file.

As such, the bugs are probably not affecting a lot of files, but they of course needs to be solved. I will now outcomment these blocks for the files already in stable/, such that we can get a new conversion with hopefully close to acceptable quality.

I will also file separate bugs for the two issues at hand (and making this bug a dependant of those new bugs - that means this one can't be closed before both those bugs are fixed).

albbas commented 13 years ago

Comment 4498

Date: 2011-06-14 18:29:04 +0200 From: Sjur Nørstebø Moshagen <>

Hm, I'm not so sure. Checking with our corpus DTD, I think it says that lists can be nested within lists, but not within listitems, as you can do in html. D****!!! I thought I was so close - now it is back to square one.

Sorry, no updated files today - at least not yet. I have to rest and resettle before I can try to redo these things again, now in accordance with our DTD.

No new bugs filed - there's no point as long as the structure is invalid (but xmllint didn't complain, which was kind of strange).

albbas commented 13 years ago

Comment 4500

Date: 2011-06-15 07:39:00 +0200 From: Sjur Nørstebø Moshagen <>

Got the xml structure for nested lists correct, and got the content in there as well. Getting closer...

Still some content is missing, namely divs containing spans and a elements mixed with PCDATA (mixed content. Only the text directly within the div is converted, the rest of the text is still lost.

albbas commented 13 years ago

Comment 4502

Date: 2011-06-15 12:31:01 +0200 From: Sjur Nørstebø Moshagen <>

With help from Ciprian, the final issues with the empty element conversion were solved. The conversion now seems mostly fine, with the following notes:

File 8: italic elements within p vanishes, which leads to a couple of cases of run-together words (ie the original does not contain any space between the tag and the following word, and after conversion there is no way to separate the two words anymore). See "guovllustivrratoaivvildit" in section 8.2 of that file for an example.

File 12: all lists are extended beyond its real content, including the following two paragraphs as regular list item content. This happens in §§ 25, 32, 47, 50, 54, 57, 62, 79.

There are some occurrences of wrong language encoding.

I'll compare with the previous version in the corpus repository, and then commit the conversion result I have at the moment if this new version is an improvement. Please note that the committed files will still have the issues above.

albbas commented 13 years ago

Comment 4503

Date: 2011-06-15 12:55:27 +0200 From: Sjur Nørstebø Moshagen <>

Sme files committed. Please check. I'll convert the nob files now.

albbas commented 13 years ago

Comment 4504

Date: 2011-06-15 13:31:26 +0200 From: Berit Nystad Eskonsipo <>

I have checked file nr 2 and I see that the lists above the text are also converted to xml. In the previous version this part was skipped. I don't think this will be a problem because the nob version also has these lists - in nob.

Here are the lists I'm talking about: Mana ohcamii Mana sisdollui Mana bajimus navigašuvdnii Mana báikkálaš navigašuvdnii Bokmål Nynorsk Sámegiella English

ráđđehus.no Stoltenberga nubbi ráđđehus Departemeanttat Stáhtaministara kantuvra Bargodepartemeanta Mánáid-, dásseárvo- ja searvadahttindepartemeanta Ruhtadandepartemeanta Guolástus- ja riddodepartemeanta Ođasmahttin-, hálddahus- ja girkodepartemeanta Suodjalusdepartemeanta Dearvvašvuođa- ja fuolahusdepartemeanta Justis- ja politidepartemeanta Gielda- ja guovlodepartemeanta Kulturdepartemeanta Máhttodepartemeanta Eanandoallo- ja biebmodepartemeanta Birasgáhttendepartemeanta Ealáhus-ja gávpedepartemeanta Oljo- ja energiijadepartemeanta Johtalusdepartemeanta Olgoriikadepartemeanta

Oza Eanandoallo- ja biebmodepartemeantta siidduin Oza olles ráđđehusa.no:as Juogat/CavgilGovčča Čálánhápmi

Fáddá A-Å Neahttakarta Hjelp Oktavuohta Dás don leat: Eanandoallo- ja biebmodepar... < Dokumeanttat < Proposisjoner og meldinger < 2 Láhkaevttohusa duogáš

(In reply to comment #6)

Sme files committed. Please check. I'll convert the nob files now.

albbas commented 13 years ago

Comment 4506

Date: 2011-06-15 14:22:56 +0200 From: Sjur Nørstebø Moshagen <>

Nob files commited. Changing the priority a bit, I have to do other things now, and the material should be good enough for others to look at.

We could consider closing this bug, since the original issue has been solved.

(In reply to comment #7)

I have checked file nr 2 and I see that the lists above the text are also converted to xml. In the previous version this part was skipped. I don't think this will be a problem because the nob version also has these lists - in nob.

Yes, the lists are converted in both versions. By default, it is an error if such lists are not converted (ie it is a sign that something is wrong with the conversion). Nevertheless, I discussed these lists with Trond, and we agreed to take a conservative approach: lists (and template content, often in other languages) that are clearly not part of the text, and clearly of no use to the work with Sámi Language technology should be skipped if such parts can be uniquely identifed.

Here are the lists I'm talking about: Mana ohcamii Mana sisdollui Mana bajimus navigašuvdnii Mana báikkálaš navigašuvdnii

This list can be useful in a future localisation project - converted in all languages.

Bokmål Nynorsk Sámegiella English

Such lists as this one are almost never useful, and as you'll see, the language list is not converted in the newest version in stable/.

ráđđehus.no Stoltenberga nubbi ráđđehus Departemeanttat Stáhtaministara kantuvra Bargodepartemeanta Mánáid-, dásseárvo- ja searvadahttindepartemeanta Ruhtadandepartemeanta Guolástus- ja riddodepartemeanta Ođasmahttin-, hálddahus- ja girkodepartemeanta Suodjalusdepartemeanta Dearvvašvuođa- ja fuolahusdepartemeanta Justis- ja politidepartemeanta Gielda- ja guovlodepartemeanta Kulturdepartemeanta Máhttodepartemeanta Eanandoallo- ja biebmodepartemeanta Birasgáhttendepartemeanta Ealáhus-ja gávpedepartemeanta Oljo- ja energiijadepartemeanta Johtalusdepartemeanta Olgoriikadepartemeanta

This list is definitely useful, and should always be converted. The existence of such lists might be problematic in certain contexts (e.g. it will skew some types of statistics), but we'll have to deal with those issues outside of the conversion process.

Oza Eanandoallo- ja biebmodepartemeantta siidduin Oza olles ráđđehusa.no:as Juogat/CavgilGovčča Čálánhápmi

Not very useful, and skipped.

Fáddá A-Å Neahttakarta Hjelp Oktavuohta

Potentially useful, thus included.

Dás don leat: Eanandoallo- ja biebmodepar... < Dokumeanttat < Proposisjoner og meldinger < 2 Láhkaevttohusa duogáš

Potentially useful, and not uniquely identifiable, thus kept.

The decision on what to include and what to exclude might seem randon, and I'm very open for discussion.

albbas commented 13 years ago

Comment 4507

Date: 2011-06-15 15:07:57 +0200 From: Berit Nystad Eskonsipo <>

I have checked the files after Sjur fixed the list problem. All files are ok, exept nr 12_nob: OTP200620070025000SE_12.html.xml

In this file the lists above the text is missing:

Om lov om reindrift (reindriftsloven)

Søk hos Landbruks- og matdepartementet

Søk på hele regjeringen.no

Du er her: < < <

Ot.prp. nr. 25 (2006-2007)

Om lov om reindrift (reindriftsloven)

Bla i dokumentet: | |

Forslag til lov om reindrift (reindriftsloven)

Kapittel 1 Innledende bestemmelser

Here is the same part in the sme file:

Boazodoallolága birra

Mana ohcamii

Mana sisdollui

Mana bajimus navigašuvdnii

Mana báikkálaš navigašuvdnii

ráđđehus.no

Stoltenberga nubbi ráđđehus

Stáhtaministara kantuvra

Bargodepartemeanta

Mánáid-, dásseárvo- ja searvadahttindepartemeanta

Ruhtadandepartemeanta

Guolástus- ja riddodepartemeanta

Ođasmahttin-, hálddahus- ja girkodepartemeanta

Suodjalusdepartemeanta

Dearvvašvuođa- ja fuolahusdepartemeanta

Justis- ja politidepartemeanta

Gielda- ja guovlodepartemeanta

Kulturdepartemeanta

Máhttodepartemeanta

Eanandoallo- ja biebmodepartemeanta

Birasgáhttendepartemeanta

Ealáhus-ja gávpedepartemeanta

Oljo- ja energiijadepartemeanta

Johtalusdepartemeanta

Olgoriikadepartemeanta

Oza Eanandoallo- ja biebmodepartemeantta siidduin

Oza olles ráđđehusa.no:as

Fáddá A-Å

Neahttakarta

Hjelp

Oktavuohta

Dás don leat: Eanandoallo- ja biebmodepar... < Dokumeanttat < Proposisjoner og meldinger < Evttohus

Od.prp. nr. 25 (2006-2007)

Boazodoallolága birra

Bláđe dokumeanttas: < Ráva | | Ovdasiidu

Evttohus

Boazodoalloláhka

Kapihtal 1 Álggaheaddji mearrádusat

This bug is also in the footer: nob:

Bla i dokumentet: | |

Skip left menu navigation

Landbruks- og matdepartementet, Akersgt. 59 (R5), Postboks 8007 Dep. 0030 Oslo l  Tlf: 22 24 90 90, Faks: 22 24 95 55 E-post:   l  Nettredaksjonen: Ansvarlig redaktør: l  Nettredaktør:

sme:

Bláđe dokumeanttas: < Ráva | | Ovdasiidu

There will be problems when we try to parallel these document.

albbas commented 13 years ago

Comment 4508

Date: 2011-06-15 15:12:19 +0200 From: Ciprian Gerstenberger <>

I have a suggestion wrt. this

If the result of paragraph-making procedure is an empty paragraph this should be skipped altogether.

(In reply to comment #9)

I have checked the files after Sjur fixed the list problem. All files are ok, exept nr 12_nob: OTP200620070025000SE_12.html.xml

In this file the lists above the text is missing:

Om lov om reindrift (reindriftsloven)

Søk hos Landbruks- og matdepartementet

Søk på hele regjeringen.no

Du er her: < < <

Ot.prp. nr. 25 (2006-2007)

Om lov om reindrift (reindriftsloven)

Bla i dokumentet: | |

Forslag til lov om reindrift (reindriftsloven)

Kapittel 1 Innledende bestemmelser

Here is the same part in the sme file:

Boazodoallolága birra

Mana ohcamii

Mana sisdollui

Mana bajimus navigašuvdnii

Mana báikkálaš navigašuvdnii

ráđđehus.no

Stoltenberga nubbi ráđđehus

Stáhtaministara kantuvra

Bargodepartemeanta

Mánáid-, dásseárvo- ja searvadahttindepartemeanta

Ruhtadandepartemeanta

Guolástus- ja riddodepartemeanta

Ođasmahttin-, hálddahus- ja girkodepartemeanta

Suodjalusdepartemeanta

Dearvvašvuođa- ja fuolahusdepartemeanta

Justis- ja politidepartemeanta

Gielda- ja guovlodepartemeanta

Kulturdepartemeanta

Máhttodepartemeanta

Eanandoallo- ja biebmodepartemeanta

Birasgáhttendepartemeanta

Ealáhus-ja gávpedepartemeanta

Oljo- ja energiijadepartemeanta

Johtalusdepartemeanta

Olgoriikadepartemeanta

Oza Eanandoallo- ja biebmodepartemeantta siidduin

Oza olles ráđđehusa.no:as

Fáddá A-Å

Neahttakarta

Hjelp

Oktavuohta

Dás don leat: Eanandoallo- ja biebmodepar... < Dokumeanttat < Proposisjoner og meldinger < Evttohus

Od.prp. nr. 25 (2006-2007)

Boazodoallolága birra

Bláđe dokumeanttas: < Ráva | | Ovdasiidu

Evttohus

Boazodoalloláhka

Kapihtal 1 Álggaheaddji mearrádusat

This bug is also in the footer: nob:

Bla i dokumentet: | |

Skip left menu navigation

Landbruks- og matdepartementet, Akersgt. 59 (R5), Postboks 8007 Dep. 0030 Oslo l  Tlf: 22 24 90 90, Faks: 22 24 95 55 E-post:   l  Nettredaksjonen: Ansvarlig redaktør: l  Nettredaktør:

sme:

Bláđe dokumeanttas: < Ráva | | Ovdasiidu

There will be problems when we try to parallel these document.

albbas commented 13 years ago

Comment 4510

Date: 2011-06-15 15:30:22 +0200 From: Sjur Nørstebø Moshagen <>

(In reply to comment #9)

I have checked the files after Sjur fixed the list problem. All files are ok, exept nr 12_nob: OTP200620070025000SE_12.html.xml [...] There will be problems when we try to parallel these document.

Cf svn commit r1564 in $GTFREE:

"* file 12 is not updated because of conversion errors."

That is, the nob file 12 is not yet in comparable shape. Hopefully I will be able to correct it later today or tomorrow.

(In reply to comment #10)

I have a suggestion wrt. this

If the result of paragraph-making procedure is an empty paragraph this should be skipped altogether.

I agree, as long as it is done carefully. Empty elements can also be a sign of conversion errors.

A general comment: when quoting code or earlier comments, try to not quote more than strictly necessary to convey your point. Less is more:)

albbas commented 13 years ago

Comment 4525

Date: 2011-06-16 12:55:55 +0200 From: Sjur Nørstebø Moshagen <>

The NOB file no 12 is now also successfully converted and committed (r1567), including all lists. This bug is one step further to being closed.

albbas commented 13 years ago

Comment 4526

Date: 2011-06-16 14:09:49 +0200 From: Berit Nystad Eskonsipo <>

This applies to the lists in nob and sme file no 12 - §§ 25, 32, 47, 50, 54, 57, 62, 79.

Also the text under the lists are converted as a part of the lists:

Here is the list innob no 12 - § 25:

1. brensel,

  <p type="listitem">2. gammer, koier, buer
                            eller stillinger for oppbevaring av
                            løsøre og matvarer,</p>
  <p type="listitem">3. teltstenger,
                            redskaper og enklere bruksting,</p>
  <p type="listitem">4. arbeidsgjerder
                            (trøer, ringgjerder),</p>
  <p type="listitem">5. garving.</p>
  <p type="listitem">Friskt lauvtrevirke og
                              friske busker må ikke tas så fremt
                              det på stedet eller i nærheten finnes
                              annet virke som er tjenlig for
                              formålet.Skogeieren kan kreve
                              betaling for friske lauvtrær som tas
                              i privat skog, men ellers kan det
                              ikke kreves betaling for virke som
                              rettmessig blir tatt i medhold av
                              denne paragraf. Det skal uten opphold
                              gis melding til grunneieren om uttak
                              av trevirke som denne kan kreve
                              betaling for. Oppnås ikke enighet om
                              betalingen, kan beløpets størrelse
                              kreves fastsatt ved skjønn ved
                              jordskifteretten. Finnmarkseiendommen
                              kan ikke kreve betaling etter
                              bestemmelsene i leddet her.Så langt det
                              fremstiller seg som nødvendig av
                              hensyn til skogens bevaring,
                              foryngelse eller gjenvekst eller
                              fordi det er mangel på trevirke i
                              distriktet, kan Kongen ved forskrift
                              begrense eller helt forby uttak av
                              trevirke i nærmere bestemte områder
                              og derunder bl.a. bestemme at friskt
                              virke bare kan tas etter
                              utvising.</p>
</list>
albbas commented 13 years ago

Comment 4529

Date: 2011-06-16 16:20:45 +0200 From: Sjur Nørstebø Moshagen <>

Thanks for the feedback, I had forgotten about those, even though I noticed earlier. Working on them now.

albbas commented 13 years ago

Comment 4532

Date: 2011-06-16 18:42:14 +0200 From: Sjur Nørstebø Moshagen <>

Now also the lists in file 12 should be correct. Please check and report back.

albbas commented 13 years ago

Comment 4536

Date: 2011-06-17 10:13:31 +0200 From: Berit Nystad Eskonsipo <>

The lists in nob and sme No 12 does not end at the same place as in the orig file.

Converted: No 12 nob - list in § 25:

5. garving.

  <p type="text">Friskt lauvtrevirke og ....

Orig html source: No 12 nob - list in § 25:

  • 5. garving.

  • Friskt lauvtrevirke og The list in converted file should end after the last listitem, and not after the last paragrahp in the section as it does now. §§ 25, 32, 47, 50, 54, 57, 62, 79 in nob and sme No 12 contains lists of this format.

    albbas commented 13 years ago

    Comment 4538

    Date: 2011-06-17 10:46:48 +0200 From: Sjur Nørstebø Moshagen <>

    (In reply to comment #16)

    The lists in nob and sme No 12 does not end at the same place as in the orig file.

    Yes, they do.

    Converted: No 12 nob - list in § 25:

    5. garving.

      <p type="text">Friskt lauvtrevirke og ....

    Orig html source: No 12 nob - list in § 25:

  • 5. garving.

  • Friskt lauvtrevirke og

    The outer ul element is there, and includes the non-list paragraphs.

    The list in converted file should end after the last listitem, and not after the last paragrahp in the section as it does now.

    I agree, but the problem is the input document. The output follows the input exactly - just open the html file in SubEthaEdit, choose the menu item "Format > Tidy and Pretty Print HTML", and then find one of the lists. You'll see that all of them are exactly like this.

    It might be possible to work around this, but it will take quite some work, and as it is now we don't loose any content. The structure isn't 100% semantically correct, but pretty close (the paragraphs in question are not marked as listitems, but as regular text).

    Also, I am a bit hesitant to do too much specific processing in a general conversion routine. I'll instead try to do this in the file-specific xsl. That might actually be a lot easier.

    albbas commented 13 years ago

    Comment 4539

    Date: 2011-06-17 10:58:27 +0200 From: Sjur Nørstebø Moshagen <>

    (In reply to comment #17)

    (In reply to comment #16)

    The lists in nob and sme No 12 does not end at the same place as in the orig file.

    Yes, they do.

    But thanks for noticing the problem - I hadn't.

    Also, I am a bit hesitant to do too much specific processing in a general conversion routine. I'll instead try to do this in the file-specific xsl. That might actually be a lot easier.

    It was actually very easy when working with the intermediate xml instead of the source html:)

    Fixed, and will be committed in a few moments.

    albbas commented 13 years ago

    Comment 4540

    Date: 2011-06-17 11:03:27 +0200 From: Sjur Nørstebø Moshagen <>

    The lists are fixed in r1583. Is this bug ready to be closed?

    albbas commented 13 years ago

    Comment 4541

    Date: 2011-06-17 11:33:59 +0200 From: Berit Nystad Eskonsipo <>

    The lists are great now! For me it looks like the bug is fixed.

    albbas commented 13 years ago

    Comment 4543

    Date: 2011-06-17 11:48:24 +0200 From: Sjur Nørstebø Moshagen <>

    This bug is fixed. Similar problems with other files should be given a new bug.