unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
106 stars 21 forks source link

GAO's own reports #269

Open konklone opened 8 years ago

konklone commented 8 years ago

Not the GAO IG, but the GAO itself, who publishes an amazing number of excellent reports.

There are four interesting datasets, with two known existing scrapers:

For both, in their current state I'd recommend porting them over here, rather than adding a wrapper around them or something. Perhaps we can convince @vzvenyach to move his efforts here too!

divergentdave commented 8 years ago

What's the third dataset?

konklone commented 8 years ago

Whoops! I updated the issue with it. It's the restricted reports.

vdavez commented 8 years ago

@konklone happy to port it over to /unitedstates. :us:

lukerosiak commented 8 years ago

I'm working on a scraper that will do GAO reports and restricted reports.

There is some stuff dealing with citations in the Ruby parser. I'm assuming that can be omitted.

GAO usually provides "accessible text" .txt versions, which the Ruby parser uses to avoid pdftotext'ing. I will include the .txt URL in the json, but I don't think inspectors-general provides a way to manually give the text that should hit elasticsearch, so it can just process the PDFs as normal.