unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

[usps] [sba] Reports are falling through the cracks between pages #214

Closed divergentdave closed 9 years ago

divergentdave commented 9 years ago

As described in #213, the USPS document library uses an unstable sort algorithm. If more than one report with the same date span a pagination boundary, we may see one report on both pages while missing another report entirely. We could probably detect this and re-fetch the offending pages until we make up the difference.

divergentdave commented 9 years ago

The SBA scraper has the same problem

konklone commented 9 years ago

Really? I would not expect this to be a common issue.

The easiest way to solve both issues would seem to be to specify an explicit sort order.

divergentdave commented 9 years ago

I can certainly see how it could happen, writing ORDER BY datetime DESC or ORDER BY year DESC, month DESC, date DESC seems like the sensible thing to do. I have a plan for retrying pages when we miss rows, going to try it out on SBA first.

divergentdave commented 9 years ago

Closed by #213 and #223