Closed divergentdave closed 9 years ago
The dupe from the ARC scraper is a copy-paste error in the website. I've emailed the webmaster.
I've been looking into the USPS scraper, and it seems that the USPS document library is nondeterministic. If you download one page over and over like so...
curl 'https://uspsoig.gov/document-library?type=All&field_doc_date_value%5Bvalue%5D=1998-01-01&field_doc_cat_tid%5B1920%5D=1920&field_doc_cat_tid%5B1923%5D=1923&field_doc_cat_tid%5B1922%5D=1922&page=0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C6' > 6a.html
... you will find that reports issued on the same day show up in a different order between different page loads. This causes warnings for us when, for example, a report appears at the end of page 6 and also the beginning of page 7 when we grab it. Of course, if that happens, it means we're also missing a report at the same page boundary. I'll deal with doubling up on reports now, and open up an issue for later to investigate automatically retrying pages when we miss something.
It's done, ready for review!
This is outstanding, comprehensive work, @divergentdave. Thank you for doing this. Integrating into the production servers now.
WIP, this will fix #197 and more.