impresso / impresso-pycommons

Python module with bits of code (objects, functions) highly reusable within impresso.
http://impresso-pycommons.rtfd.io/
GNU Affero General Public License v3.0
3 stars 3 forks source link

missing text on articles spanning several pages #40

Closed e-maud closed 5 years ago

e-maud commented 5 years ago

@e-maud and @mromanello From Raphaël:

In dhSegment sample, all articles that are recognized by the OLR as spanning more than one page have text only on the first page in the rebuild.

Often not the case for later period, because all articles are on one page.

mromanello commented 5 years ago

I've been investigating this, starting from GDL-1807-06-23-a-i0006. This is the entry for this item in the issue ToC:

{
 "l": {
  "id": [
   "Ar00201",
   "Ar00300",
   "Ar00400",
   "Ar00500"
  ],
  "source": [
   "103-GDL-1807-06-23-0001.pdf",
   "103-GDL-1807-06-23-0001.pdf",
   "103-GDL-1807-06-23-0001.pdf",
   "103-GDL-1807-06-23-0001.pdf"
  ]
 },
 "m": {
  "id": "GDL-1807-06-23-a-i0006",
  "l": "fr",
  "pp": [
   2,
   3,
   4,
   5
  ],
  "t": "\u2022ALLEMAGNE.",
  "tp": "article"
 }
}

The regions in page 2 are fine. The regions that belong to this article in page 3 have pOf == GDL-1807-06-23-a-i0008 instead of GDL-1807-06-23-a-i0006, etc.

So it's a problem of text ingestion rather than of rebuilt. For multipart articles we need to add a mapping between the canonical id of the article and the canonical IDs of its parts.