Closed e-maud closed 5 years ago
I've been investigating this, starting from GDL-1807-06-23-a-i0006
. This is the entry for this item in the issue ToC:
{
"l": {
"id": [
"Ar00201",
"Ar00300",
"Ar00400",
"Ar00500"
],
"source": [
"103-GDL-1807-06-23-0001.pdf",
"103-GDL-1807-06-23-0001.pdf",
"103-GDL-1807-06-23-0001.pdf",
"103-GDL-1807-06-23-0001.pdf"
]
},
"m": {
"id": "GDL-1807-06-23-a-i0006",
"l": "fr",
"pp": [
2,
3,
4,
5
],
"t": "\u2022ALLEMAGNE.",
"tp": "article"
}
}
The regions in page 2 are fine. The regions that belong to this article in page 3 have pOf == GDL-1807-06-23-a-i0008
instead of GDL-1807-06-23-a-i0006
, etc.
So it's a problem of text ingestion rather than of rebuilt. For multipart articles we need to add a mapping between the canonical id of the article and the canonical IDs of its parts.
@e-maud and @mromanello From Raphaël:
In dhSegment sample, all articles that are recognized by the OLR as spanning more than one page have text only on the first page in the rebuild.
Article begins on page 2 ends on page 5, only text for page 2 (other articles also have the problem in the journal) https://impresso-project.ch/alpha/#/issue/GDL-1807-06-23-a/page/GDL-1807-06-23-a-p0002/article/GDL-1807-06-23-a-i0006
Article begins on page 2, ends on page 3, text only on page 2 https://impresso-project.ch/alpha/#/issue/GDL-1882-06-28-a/page/GDL-1882-06-28-a-p0002/article/GDL-1882-06-28-a-i0012
Articles begins on page 1, ends on page 2, text only on page 1 https://impresso-project.ch/alpha/#/issue/GDL-1902-10-29-a/page/GDL-1902-10-29-a-p0001/article/GDL-1902-10-29-a-i0002
Article begins on page 2, ends on page 3, text only on page 2 https://impresso-project.ch/alpha/#/issue/EXP-1925-09-12-a/page/EXP-1925-09-12-a-p0002/article/EXP-1925-09-12-a-i0030
Often not the case for later period, because all articles are on one page.