incomplete ingestion when chunk-size is defined

There is an issue in the generic importer when processing chunks of data. In this case, only full chunks are correctly ingested while the remaining years after the last full chunk are skipped at the end of the process.

This issue emerged when I ingested data for the FedGaz between 1849-1999, specified within a config file. However, only the years up to 1989 are processed and uploaded to s3. These are the last lines in the log:

2020-04-06 23:47:59,816 text_importer.importers.core INFO     Processing chunk of 1980
2020-04-06 23:47:59,851 text_importer.importers.core INFO     Start compressing and uploading issues
2020-04-06 23:58:03,784 text_importer.importers.core INFO     Done compressing and uploading
2020-04-07 00:00:40,326 text_importer.importers.core INFO     Processing chunk 0
2020-04-07 00:00:43,043 text_importer.importers.core INFO     Now compress and upload pages
2020-04-07 00:22:47,431 text_importer.importers.core INFO     Processing chunk 1
2020-04-07 00:22:47,461 text_importer.importers.core INFO     Now compress and upload pages
2020-04-07 00:28:41,683 text_importer.importers.core INFO     Processing chunk of 1990
2020-04-07 00:28:41,717 text_importer.importers.core INFO     Start compressing and uploading issues

Matteo's guess is that this is related to the filter specified in the config file. Without that filter it should work and not forget any data.

impresso / impresso-text-acquisition

incomplete ingestion when chunk-size is defined #92