There is an issue in the generic importer when processing chunks of data. In this case, only full chunks are correctly ingested while the remaining years after the last full chunk are skipped at the end of the process.
This issue emerged when I ingested data for the FedGaz between 1849-1999, specified within a config file. However, only the years up to 1989 are processed and uploaded to s3. These are the last lines in the log:
2020-04-06 23:47:59,816 text_importer.importers.core INFO Processing chunk of 1980
2020-04-06 23:47:59,851 text_importer.importers.core INFO Start compressing and uploading issues
2020-04-06 23:58:03,784 text_importer.importers.core INFO Done compressing and uploading
2020-04-07 00:00:40,326 text_importer.importers.core INFO Processing chunk 0
2020-04-07 00:00:43,043 text_importer.importers.core INFO Now compress and upload pages
2020-04-07 00:22:47,431 text_importer.importers.core INFO Processing chunk 1
2020-04-07 00:22:47,461 text_importer.importers.core INFO Now compress and upload pages
2020-04-07 00:28:41,683 text_importer.importers.core INFO Processing chunk of 1990
2020-04-07 00:28:41,717 text_importer.importers.core INFO Start compressing and uploading issues
Matteo's guess is that this is related to the filter specified in the config file. Without that filter it should work and not forget any data.
I think I sounded the alarm in error. Most likely, an erroneous PDF in the last chunk led to the break down of the process due to some exception mishandling. I close the issues as I cannot reproduce the error.
There is an issue in the generic importer when processing chunks of data. In this case, only full chunks are correctly ingested while the remaining years after the last full chunk are skipped at the end of the process.
This issue emerged when I ingested data for the FedGaz between 1849-1999, specified within a config file. However, only the years up to 1989 are processed and uploaded to s3. These are the last lines in the log:
Matteo's guess is that this is related to the filter specified in the config file. Without that filter it should work and not forget any data.