unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

More unique report_id fixes #213

Closed divergentdave closed 9 years ago

divergentdave commented 9 years ago

WIP, this will fix #197 and more.

divergentdave commented 9 years ago

The dupe from the ARC scraper is a copy-paste error in the website. I've emailed the webmaster.

divergentdave commented 9 years ago

I've been looking into the USPS scraper, and it seems that the USPS document library is nondeterministic. If you download one page over and over like so...

curl 'https://uspsoig.gov/document-library?type=All&field_doc_date_value%5Bvalue%5D=1998-01-01&field_doc_cat_tid%5B1920%5D=1920&field_doc_cat_tid%5B1923%5D=1923&field_doc_cat_tid%5B1922%5D=1922&page=0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C6' > 6a.html

... you will find that reports issued on the same day show up in a different order between different page loads. This causes warnings for us when, for example, a report appears at the end of page 6 and also the beginning of page 7 when we grab it. Of course, if that happens, it means we're also missing a report at the same page boundary. I'll deal with doubling up on reports now, and open up an issue for later to investigate automatically retrying pages when we miss something.

divergentdave commented 9 years ago

It's done, ready for review!

konklone commented 9 years ago

This is outstanding, comprehensive work, @divergentdave. Thank you for doing this. Integrating into the production servers now.