impresso / impresso-text-acquisition

🛠️ Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

[BCUL] - Reingest years 1866 and 1867 of `TouSuIl` #132

Open piconti opened 3 months ago

piconti commented 3 months ago

As already discussed with Theophile, there were a few duplicated issues in the BCUL data, for which the duplicated identifier was not valid anymore (removed form the API). Most of them had been identified during the canonical (or rebuilt) ingestion, but three were remaining in TouSuIl (Le Touriste - la Suisse Illustrée). The issues are the following:

Unfortunately, they were processed after their correct counterpart, so the page files were overwritten with faulty information. Fixing them by hand is possible, but could be prone to errors. It seems more adapted to re-ingest the two specific years of the BCUL data as part of the next ingestion, to ensure this problem is fixed.

Currently, the issues were simply ignored from the MySQL ingestion.