Closed spulec closed 10 years ago
Nice catch -- and actually, what do you know, the thorough @MRumsey totally caught the TIGTA in his dragnet spreadsheet. So yeah, that'll be its own scraper.
One thing I noticed, not a merge-breaker, dash prepended to a title: `"title": "- SAR Data Quality Requires FinCEN's Continued Attention",
Also, I noticed some agency
fields are oig
, but I don't think it's the OIG auditing itself. (I could be wrong.)
Anyway, this otherwise looks great, and I'm happy to merge in.
The scraper got a 500 overnight because it requested this URL: http://www.treasury.gov/about/organizational-structure/ig/Pages/by-date-{}.aspx
. Not clear how it happened. Happy to take a look at it when I have a sec.
I ran this locally and on a test server and wasn't able to reproduce it.
I'm sure it is a result of https://github.com/unitedstates/inspectors-general/blob/master/inspectors/treasury.py#L142 and link.get('href')
must be returning None
somewhere. Maybe another issue of different versions of BeautifulSoup?
I'm not exactly sure what else I can do to debug. I can add an assertion that link.get('href')
is truthy.
It probably means the site returns some other HTML or string for a server error (though I think it's probably returning an actual 200 status code, or it would trigger the scrapelib error clause, and this is how it manifests in the scraper. I guess the Right Way would be to check that the HTML returned is valid. It's not an urgent issue.
This is not a pretty one, but it seems to work. There are a few different report listing pages, which all have slightly different markup.
While working on this, I also learned about the Treasury Inspector General for Tax Administration which seems to be a separate IG that is also part of Treasury, but specifically tasked with IRS oversight. I think this should probably be treated as a separate scraper, but wanted to make the note here so it doesn't fall through the cracks.