impresso / federal-gazette

0 stars 0 forks source link

Wrong metadata prevents correct article segmentation for supplements #11

Open aflueckiger opened 4 years ago

aflueckiger commented 4 years ago

Starting Point

Two subsequent articles may share a single page. Specifically, there may be two instances of the same page in the original data in cases of an in-page article segmentation. An in-page article segmentation refers to the case where an article ends on the same page where the next article begins. To set the correct boundaries between articles and avoid duplication of content, cases of in-page segmentation need to be identified.

Problem

Thus, a heuristic procedure is in place, leveraging the original metadata. Unfortunately, the metadata for supplements is not always in line with the actual assembling of an issue, as reflected by the printed page numbers. For supplements, an alleged in-page article segmentation turns out to be wrong in many cases. While supplements mostly start on a new page at the end of an issue, a minority number of cases share a page with the previous article similar to regular articles of an issue.

Consequence

Thus, the heuristic presumes that supplements never share pages with a previous article. As a result, approximately 250 cases of in-page segmentation go undetected per language. Moreover, these undetected cases lead to duplicated content in the corpus. A single page is assigned twice, once to the former, and once to the subsequent article. There is no easy fix without adding much more complexity to the pipeline due to the erroneous metadata. Moreover, the issue's scope can hardly be estimated unless a sample of supplements is manually annotated and extrapolated on the corpus.