impresso / impresso-text-acquisition

🛠️ Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

incomplete ingestion when chunk-size is defined #92

Closed aflueckiger closed 4 years ago

aflueckiger commented 4 years ago

There is an issue in the generic importer when processing chunks of data. In this case, only full chunks are correctly ingested while the remaining years after the last full chunk are skipped at the end of the process.

This issue emerged when I ingested data for the FedGaz between 1849-1999, specified within a config file. However, only the years up to 1989 are processed and uploaded to s3. These are the last lines in the log:

2020-04-06 23:47:59,816 text_importer.importers.core INFO     Processing chunk of 1980
2020-04-06 23:47:59,851 text_importer.importers.core INFO     Start compressing and uploading issues
2020-04-06 23:58:03,784 text_importer.importers.core INFO     Done compressing and uploading
2020-04-07 00:00:40,326 text_importer.importers.core INFO     Processing chunk 0
2020-04-07 00:00:43,043 text_importer.importers.core INFO     Now compress and upload pages
2020-04-07 00:22:47,431 text_importer.importers.core INFO     Processing chunk 1
2020-04-07 00:22:47,461 text_importer.importers.core INFO     Now compress and upload pages
2020-04-07 00:28:41,683 text_importer.importers.core INFO     Processing chunk of 1990
2020-04-07 00:28:41,717 text_importer.importers.core INFO     Start compressing and uploading issues

Matteo's guess is that this is related to the filter specified in the config file. Without that filter it should work and not forget any data.

aflueckiger commented 4 years ago

I think I sounded the alarm in error. Most likely, an erroneous PDF in the last chunk led to the break down of the process due to some exception mishandling. I close the issues as I cannot reproduce the error.