Open konklone opened 8 years ago
What's the third dataset?
Whoops! I updated the issue with it. It's the restricted reports.
@konklone happy to port it over to /unitedstates
. :us:
I'm working on a scraper that will do GAO reports and restricted reports.
There is some stuff dealing with citations in the Ruby parser. I'm assuming that can be omitted.
GAO usually provides "accessible text" .txt versions, which the Ruby parser uses to avoid pdftotext'ing. I will include the .txt URL in the json, but I don't think inspectors-general provides a way to manually give the text that should hit elasticsearch, so it can just process the PDFs as normal.
Not the GAO IG, but the GAO itself, who publishes an amazing number of excellent reports.
There are four interesting datasets, with two known existing scrapers:
For both, in their current state I'd recommend porting them over here, rather than adding a wrapper around them or something. Perhaps we can convince @vzvenyach to move his efforts here too!