If the PDF's not being downloaded because it's already cached, and a .txt version of the file already exists, there's no reason to re-extract text. This will speed up sync time of the scraper.
Right now, a sync of 2014 reports across our current slate of IGs takes ~50 minutes, even when not downloading m/any reports. It's not clear to me how much of that is re-extracting text, but it's an easy optimization.
If the PDF's not being downloaded because it's already cached, and a
.txt
version of the file already exists, there's no reason to re-extract text. This will speed up sync time of the scraper.Right now, a sync of 2014 reports across our current slate of IGs takes ~50 minutes, even when not downloading m/any reports. It's not clear to me how much of that is re-extracting text, but it's an easy optimization.