impresso / impresso-text-acquisition

🛠️ Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

[FedGaz] - Duplicated pages #131

Open piconti opened 3 months ago

piconti commented 3 months ago

In the FedGaz data, there are many occurencies where one page is duplicated. FedGaz has a low number of content-items, spanning several pages. It appears that the page switching from one content-item to the next (when 2 content-items are on the same page) the page got duplicated in the data.

The fix for this may be very simple (i.e. removing the duplicates from the lists pp in the content-item's metadata), but it might be interesting to investigate it further.

Example issues with such issues, more are available in this notebook.

FedGazDe:
1849: [
        "FedGazDe-1849-06-15-a",
        "FedGazDe-1849-04-18-a",
        "FedGazDe-1849-07-18-a",
        "FedGazDe-1849-06-12-a",
        "FedGazDe-1849-03-21-a",
        "FedGazDe-1849-10-06-a"
],
1852: [
    "FedGazDe-1852-06-16-a"
],
1856: [
    "FedGazDe-1856-12-26-a"
]

FedGazFr:
1849: [
        "FedGazFr-1849-06-15-a",
        "FedGazFr-1849-02-24-a",
        "FedGazFr-1849-07-18-a",
        "FedGazFr-1849-05-01-a",
        "FedGazFr-1849-06-12-a",
        "FedGazFr-1849-08-04-a",
        "FedGazFr-1849-03-26-a"
],
1850: [
    "FedGazFr-1850-08-31-a"
],