unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Add Treasury #78

Closed spulec closed 10 years ago

spulec commented 10 years ago

This is not a pretty one, but it seems to work. There are a few different report listing pages, which all have slightly different markup.

While working on this, I also learned about the Treasury Inspector General for Tax Administration which seems to be a separate IG that is also part of Treasury, but specifically tasked with IRS oversight. I think this should probably be treated as a separate scraper, but wanted to make the note here so it doesn't fall through the cracks.

konklone commented 10 years ago

Nice catch -- and actually, what do you know, the thorough @MRumsey totally caught the TIGTA in his dragnet spreadsheet. So yeah, that'll be its own scraper.

One thing I noticed, not a merge-breaker, dash prepended to a title: `"title": "- SAR Data Quality Requires FinCEN's Continued Attention",

Also, I noticed some agency fields are oig, but I don't think it's the OIG auditing itself. (I could be wrong.)

Anyway, this otherwise looks great, and I'm happy to merge in.

konklone commented 10 years ago

The scraper got a 500 overnight because it requested this URL: http://www.treasury.gov/about/organizational-structure/ig/Pages/by-date-{}.aspx. Not clear how it happened. Happy to take a look at it when I have a sec.

spulec commented 10 years ago

I ran this locally and on a test server and wasn't able to reproduce it.

I'm sure it is a result of https://github.com/unitedstates/inspectors-general/blob/master/inspectors/treasury.py#L142 and link.get('href') must be returning None somewhere. Maybe another issue of different versions of BeautifulSoup?

I'm not exactly sure what else I can do to debug. I can add an assertion that link.get('href') is truthy.

konklone commented 10 years ago

It probably means the site returns some other HTML or string for a server error (though I think it's probably returning an actual 200 status code, or it would trigger the scrapelib error clause, and this is how it manifests in the scraper. I guess the Right Way would be to check that the HTML returned is valid. It's not an urgent issue.