unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
106 stars 21 forks source link

Adding FOIA'd IG reports via GovernmentAttic.org scraper #276

Closed lukerosiak closed 8 years ago

lukerosiak commented 8 years ago

Adding FOIA'd IG reports is good because they are usually more scandalous than the ones published online. Some FOIA experts have created GovernmentAttic.org, which houses 2,000+ government reports obtained through records requests. Usually they FOIA for an index of all reports by an IG, then submit individual ones that seem especially interesting. The site is regularly updated.

So I think it makes sense to piggyback off them (and I have gotten permission from them) since it is the only way to incorporate FOIA'd IG reports in an automated way.

This scraper first narrows GovAttic's reports down to only those from IGs, and then within that, to only IGs already tracked by oversight.garden. This leaves you with about 420 documents right now. Some of these are actually multiple related IG reports in one document--sometimes eight or more. The date is the date it was obtained under FOIA/uploaded to GovAttic, not the date it was written by the IG. The inspector slug is set to the IG's slug, meaning it saves the documents alongside PDFs produced by the actual IG site scrapers, rather than a folder called govattic.

Comments from Eric: "I definitely welcome that contribution too, though it will be a bit more complicated. In small part because it's an unofficial source, but in large part because the quality of the documents I've seen there tends to be really poor and will need a lot of OCRing. But it's also a huge trove of super relevant documents (including the names of a ton of unreleased IG reports), so it's definitely worth including here if you're going to write it."

Response: a) GovernmentAttic OCRs all documents they receive using high-quality OCR software, so we can extract text with pdftotext. The pdftotext layer appears to be surprisingly accurate, even for those PDFs that are poorly scanned image files. But image quality can always be a problem when dealing with FOIA'd docs. Usually the gov scanned them that way and put them on the CD, it's not the requestor responsible for quality loss. It is becoming more common that agencies send natives PDF via CDs, but because of their redaction processes and other reasons, they usually don't.

b) To your point about official vs. unofficial, FOIA'd documents, which the README asks for, are always going to be unofficial. If there were a government resource for them, we wouldn't need FOIAs. Given that, GovAttic is as good as it gets because its judgement about what to ask for mirrors that of what the average person might find interesting; its requests aren't limited to some speciality niche or bias.

konklone commented 8 years ago

@lukerosiak This is outstanding, and my thanks for writing this, and for working with Government Attic to get their :+1: on including the work this way.

I want to give this some real review, and incorporation of the results into a local copy of oversight.garden, before merging -- but offhand this looks very thorough.

I'm guessing governmentattic.org is updated by hand as static files, and doesn't have the ability to easily add an RSS feed?

lukerosiak commented 8 years ago

Thank you! You'll be surprised how good the .json files look because the PDFs have really good metadata of keywords, etc. embedded in them. You're right that there's no CMS behind the site, so we can't get RSS.

konklone commented 8 years ago

Just mentioning that I'm not dead, and this is still in my todo list -- I should be able to get to it by the end of next weekend. Others on the project are welcome to perform the review as well.

divergentdave commented 8 years ago

First off, thanks a ton for putting this together, @lukerosiak! This will be a great addition to the data set. I'm going to do my review changes as another PR to your branch, in part so I can test what I'm saying, and in part because the character set and date parsing bits are getting hairy.

In looking at remove_non_ascii(), it appears that utils.download() is not detecting the correct character set for some pages. The governmentattic.org server doesn't provide a character set in the Content-Type, but the HTML does include a meta tag that specifies utf-8. This gets lost in the composition of requests and BeautifulSoup, resulting in mojibake when requests guesses wrong. My plan is to add a special case to utils.download() telling requests to use utf-8. Between that and the inspector.sanitize() method, it looks like we'll only have center dots and trademark symbols left, which is fine for our purposes.

Following #273, I'd like to avoid having a default date for reports. I'm going to add a call to the new inspector.log_no_date() function, and add a couple more heuristic tweaks to parse more dates.

I have opened an issue over at konklone/oversight.garden#99 to index and search the PDF keywords.

Thanks again, and expect a meta-PR from me briefly.

Nits:

lukerosiak commented 8 years ago

@divergentdave Looks great to me, thank you for making those improvements! I will keep them in mind for the future. Do you need me to do anything or are we good to go on GovAttic?

I noticed that OSC reports are not online even though my prior PR has been merged in ( #236 ). Do I need to do anything on that?

After OSC and GovAttic are online, I will move on to GAO per #269 .

divergentdave commented 8 years ago

Hmm, #236 probably didn't get deployed. I'll look into that tonight. Thanks again!

lukerosiak commented 8 years ago

I believe this still needs to be deployed as well to add OSC to the list of inspectors on the oversight.garden repo:

https://github.com/konklone/oversight.garden/pull/93

On Fri, Apr 8, 2016 at 12:57 PM, David Cook notifications@github.com wrote:

Hmm, #236 https://github.com/unitedstates/inspectors-general/issues/236 probably didn't get deployed. I'll look into that tonight. Thanks again!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/unitedstates/inspectors-general/pull/276#issuecomment-207514990

divergentdave commented 8 years ago

Both are properly deployed now, indexing is still in progress.

https://oversight.garden/reports?query=governmentattic.org https://oversight.garden/reports?inspector=osc