Don't re-extract text when not re-downloading PDF

unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.

https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/

Creative Commons Zero v1.0 Universal

107 stars 21 forks source link

Don't re-extract text when not re-downloading PDF #73

Closed konklone closed 9 years ago

konklone commented 10 years ago

If the PDF's not being downloaded because it's already cached, and a .txt version of the file already exists, there's no reason to re-extract text. This will speed up sync time of the scraper.

Right now, a sync of 2014 reports across our current slate of IGs takes ~50 minutes, even when not downloading m/any reports. It's not clear to me how much of that is re-extracting text, but it's an easy optimization.